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Preface 


An essential part of learning econometrics is the application of the methods to real-world problems 
and data. The practical implementation and application of econometric methods and tools helps 
tremendously with understanding the concepts. But learning how to use a software package also 
has great benefits in and of itself. Nowadays, a vast majority of our students will have to deal with 
some sort of data analysis in their careers. So a solid understanding of some serious data analysis 
software is an invaluable asset for any student of economics, business administration, and related 
fields. 

But what software package is the right one for learning econometrics? That’s a tough question. 
Possibly the most important aspect is that it is widely used both in and outside of academia. A 
large and active user community helps the software to remain up to date and increases the chances 
that somebody else has already solved the problem at hand. And fluency in a software package is 
especially valuable on the job market as well as on the job if it is popular. Another aspect for the 
software choice is that it is easily (and ideally freely) available to all students. 

Python is an ideal candidate for starting to learn econometrics and data analysis. It has a huge 
user base, especially in the fields of data science, machine learning, and artificial intelligence, where 
it arguably is the most popular software overall. These are very exciting areas and there is a lot 
of cutting edge research in the integration of their tools into the econometrics toolbox. So why not 
kill two birds with one stone and master a powerful and important software package while learning 
econometrics at the same time? Because Python must be hard to learn and to apply to econometrics? 
It is not at all, as this book shows. 

And Python is completely free and available for all relevant operating systems. When using it 
in econometrics courses, students can easily download a copy to their own computers and use it 
at home (or their favorite cafés) to replicate examples and work on take-home assignments. This 
hands-on experience is essential for the understanding of the econometric models and methods. It 
also prepares students to conduct their own empirical analyses for their theses, research projects, 
and professional work. 

A problem we encountered when teaching introductory econometrics classes is that the textbooks 
that also introduce Python do not discuss econometrics. Conversely, our favorite introductory econo- 
metrics textbooks do not cover Python. Although it is possible to combine a good econometrics text- 
book with an unrelated introduction to Python, this creates substantial hurdles because the topics 
and order of presentation are different, and the terminology and notation are inconsistent. 

This book does not attempt to provide a self-contained discussion of econometric models and 
methods. Instead, it builds on the excellent and popular textbook "Introductory Econometrics" by 
Wooldridge (2019). It is compatible in terms of topics, organization, terminology, and notation, and 
is designed for a seamless transition from theory to practice. 

The first chapter provides a gentle introduction to Python, covers some of the topics of basic 
statistics and probability presented in the appendix of Wooldridge (2019), and introduces Monte 
Carlo simulation as an additional tool. The other chapters have the same names and cover the same 
material as the respective chapters in Wooldridge (2019). Assuming the reader has worked through 
the material discussed there, this book explains and demonstrates how to implement everything 
in Python and replicates many textbook examples. We also open some black boxes of the built-in 
functions for estimation and inference by directly applying the formulas known from the textbook 


2 Preface 


to reproduce the results. Some supplementary analyses provide additional intuition and insights. 
We want to thank Lars Grénberg providing us with many suggestions and valuable feedback about 
the contents of this book. 

The book is designed mainly for students of introductory econometrics who ideally use 
Wooldridge (2019) as their main textbook. It can also be useful for readers who are familiar with 
econometrics and possibly other software packages. For them, it offers an introduction to Python 
and can be used to look up the implementation of standard econometric methods. Because we are 
explicitly building on Wooldridge (2019), it is useful to have a copy at hand while working through 
this book. 

Note that there is a sister book Using R for Introductory Econometrics, just published as a second 
edition, see http: //www.URf£IE.net. We based this book on the R version, using the same struc- 
ture, the same examples, and even much of the same text where it makes sense. This decision was 
not only made for laziness. It also helps readers to easily switch back and forth between the books. 
And if somebody worked through the R book, she can easily look up the pythonian way to achieve 
exactly the same results and vice versa, making it especially easy to learn both languages. Which one 
should you start with (given your professor hasn't made the decision for you)? Both share many of 
the advantages like having a huge and active user community, being widely used inside and outside 
of academia and being freely available. R is traditionally used in statistics, while Python is domi- 
nant in machine learning and artificial intelligence. These origins are still somewhat reflected in the 
availability of specialized extension packages. But most of all data analysis and econometrics tasks 
can be equally well performed in both packages. At the end, it's most important point is to get used 
to the workflow of some dedicated data analysis software package instead of not using any software 
or a spreadsheet program for data analysis. 

All computer code used in this book can be downloaded to make it easier to replicate the results 
and tinker with the specifications. The companion website also provides the full text of this book 
for online viewing and additional material. It is located at: 


http://www.UPfIE.net 


1. Introduction 


Learning to use Python is straightforward but not trivial. This chapter prepares us for implementing 
the actual econometric analyses discussed in the following chapters. First, we introduce the basics 
of the software system Python in Section 1.1. In order to build a solid foundation we can later rely 
on, Chapters 1.2 through 1.4 cover the most important concepts and approaches used in Python like 
working with objects, dealing with data, and generating graphs. Sections 1.5 through 1.7 quickly 
go over the most fundamental concepts in statistics and probability and show how they can be 
implemented in Python. More advanced Python topics like conditional execution, loops, functions 
and object orientation are presented in Section 1.8. They are not really necessary for most of the 
material in this book. An exception is Monte Carlo simulation which is introduced in Section 1.9. 


1.1. Getting Started 


Before we can get going, we have to find and download the relevant software, figure out how the 
examples presented in this book can be easily replicated and tinkered with, and understand the most 
basic aspects of Python. That is what this section is all about. 


1.1.1. Software 


Python is a free and open source software. Its homepage is https: //www.python.org/. There, a 
wealth of information is available as well as the software itself. We recommend installing the Python 
distribution Anaconda (also open source), which includes Python plus many tools needed for data 
analysis. For more information and installation files, see https: //www.anaconda.com. 

Distributions are available for Windows, Mac, and Linux systems and come in two versions. The 
examples in this book are based on the installation of the latest version, Python 3. It is not backwards 
compatible to Python 2. 


Figure 1.1. Python in the Command Line 


(base) Daniels-MacBook:- brunned$ python 
Python 3.7.4 (default, Aug 13 2019, 15:17:50) 
[Clang 4.0.1 (tags/RELEASE 401/fina1)] :: Anaconda, Inc. on darwin 
Type "holp*, "copyright", "credits" or "license" for wore information. 
>>> To be, or not to be: that is the question 
File *«stdim»*, line 1 
To be, or mot to be: that is the question 


SystasError: invalid syntax 
>>> print('To be, or not to be: that is the question) 
To be, or not to be: that is the question 

>>> (142)45 


After downloading and installing, Python can be accessed by the command line interface. In 
Windows, run the program "Anaconda Prompt". In Linux or macOS you can simply open a terminal 
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window. You start Python by typing python and pressing the return key ( +] ). This will look similar 
to the screenshot in Figure 1.1. It provides some basic information on Python and the installed 
version. Right to the “>>>” sign is the prompt where the user can type commands for Python to 
evaluate. 

We can type whatever we want here. After pressing Te] , the line is terminated, Python tries to 
make sense out of what is written and gives an appropriate answer. In the example shown in Figure 
1.1, this was done four times. The texts we typed are shown next to the “>>>” sign, Python answers 
under the respective line. 

Our first attempt did not work out well: We have got an error message. Unfortunately, Python does 
not comprehend the language of Shakespeare. We will have to adjust and learn to speak Python's 
less poetic language. The second command shows one way to do this. Here, we provide the input 
to the command print in the correct syntax, so Python understands that we entered text and knows 
what to do with it: print it out on the console. Next, we gave Python simple computational tasks and 
got the result under the respective command. The syntax should be easy to understand — apparently, 
Python can do simple addition and deals with the parentheses in the expected way. The meaning of 
the last command is less obvious, because it uses the pythonian way of calculating an exponential 
term: 16*«0.5 = 1625 = 16 = 4. 

Python is used by typing commands such as these. Not only Apple users may be less than im- 
pressed by the design of the user interface and the way the software is used. There are various 
approaches to make it more user friendly by providing a different user interface added on top of 
plain Python. Notable examples include IDLE, PyCharm, Visual Studio and Spyder. The latter was 
already set up during the installation of Anaconda and we use it for all what follows. The easiest 
way to start Spyder is by selecting it in the Anaconda Navigator that was also set up during the 
installation of Anaconda. 

A screenshot of the user interface on a Mac computer is shown in Figure 1.2 (on other systems it 
will look very similar). There are several sub-windows. The one on the bottom right named "IPython 
console" looks very similar and behaves exactly the same as the command line. The usefulness of 
the other windows will become clear soon. 

Here are a few quick tricks for working in the console of Spyder: 

* When starting to type a command, press the tabulator key [s=] to see a list of suggested 
commands. Typing pr, for example, followed by |<) gives a list of all Python commands 
starting with pr, like the print command. 

* Use help (command) to print the help page for the provided command. 

e With the (t; and |, arrow keys, we can scroll through the previously entered commands to 
repeat or correct them. 


1.1.2. Python Scripts 


As already seen, we will have to get used to interacting with our software using written commands. 
While this may seem odd to readers who do not have any experience with similar software at this 
point, it is actually very common for econometrics software and there are good reasons for this. An 
important advantage is that we can easily collect all commands we need for a project in a text file 
called Python script. 

A Python script contains all commands including those for reading the raw data, data manip- 
ulation, estimation, post-estimation analyses, and the creation of graphs and tables. In a complex 
project, these tasks can be divided into separate Python scripts. The point is that the script(s) together 
with the raw data generate the output used in the term paper, thesis, or research paper. We can then 
ask Python to evaluate all or some of the commands listed in the Python script at once. 
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Figure 1.2. Spyder User Interface 
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This is important since a key feature of the scientific method is reproducibility. Our thesis adviser 
as well as the referee in an academic peer review process or another researcher who wishes to build 
on our analyses must be able to fully understand where the results come from. This is easy if we can 
simply present our Python script which has all the answers. 

Working with Python scripts is not only best practice from a scientific perspective, but also very 
convenient once we get used to it. In a nontrivial data analysis project, it is very hard to remember 
all the steps involved. If we manipulate the data for example by directly changing the numbers in a 
spreadsheet, we will never be able to keep track of everything we did. Each time we make a mistake 
(which is impossible to avoid), we can simply correct the command and let Python start from scratch 
by a simple mouse click if we are using scripts. And if there is a change in the raw data set, we can 
simply rerun everything and get the updated tables and figures instantly. 

Using Python scripts is straightforward: We just write our commands into a text file and save it 
with a “.py” extension. When using a user interface like Spyder, working with scripts is especially 
convenient since it is equipped with a specialized editor for script files. To use the editor for working 
on a new Python script, use the menu FileNew file.... 

The window in the left part of Figure 1.2 is the script editor. We can type arbitrary text, begin 
a new line with the return key, and navigate using the mouse or the [f| 14] [=] [>] arrow keys. 
Our goal is not to type arbitrary text but sensible Python commands. In the editor, we can also use 
tricks like code completion that work in the Console window as described above. A new command 
is generally started in a new line, but also a semicolon "; " can be used if we want to cram more than 
one command into one line — which is often not a good idea in terms of readability. 

An extremely useful tool to make Python scripts more readable are comments. These are lines 
beginning with a “#”. These lines are not evaluated by Python but can (and should) be used to 
structure the script and explain the steps. Python scripts can be saved and opened using the File 
menu. 

Figures 1.3 and 1.4 show a screenshot of Spyder with a Python script saved as “First-Python- 
Script.py". It consists of six lines in total including three comments. We can send lines of code to 
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Figure 1.3. Executing a Script with » 
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Python to be evaluated in two different ways: 


* Click ». The complete script is executed and only results that are explicitly printed out (by 
the command print) show up in the "IPython console" window. The example in Figure 1.3 
therefore only returns 15. 

* Execute Python commands and scripts line by line or blockwise. The window "IPython con- 
sole" shows the command you executed and the output. Press to execute the line of the 
current cursor position or a highlighted block of code (with the mouse or by holding Shift f] 
while navigating). Figure 1.4 demonstrates the execution line by line. 


In what follows, we will do everything using Python scripts. All these scripts are available for 
download to make it easy and convenient to reproduce all contents in real time when reading this 
book. As already mentioned, the address is 

http://www.UPfIE.net 

They are also printed in Appendix IV. In the text, we will not show screenshots, but the 
script files printed in bold and (if any) Python's output in standard font. The latter only 
contains output that is explicitly printed out, just like the example in Figure 1.3. Script 1.1 
(First-Python-Script.py) demonstrates the way we discuss Python code in this book:! 


Script 1.1: First-Python-Script.py 
# This is a comment. 
# in the next line, we try to enter Shakespeare: 
‘To be, or not to be: that is the question’ 

# let's try some sensible math: 

print((1 + 2) « 5) 

16 ++ 0.5 

print (’\n’) 


! To improve the readability of generated output, we will often use print commands including Vn to start a new line. 
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Figure 1.4. Executing a Script Line by Line 
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Output of Script 1.1: First-Python-Script.py 
15 


Script 1.2 (Python-as-a-Calculator.py) is a second (and more representative) example in 
which Python is used for simple tasks any basic calculator can do. The Python script and output are: 


Script 1.2: Python- 


a-Calculator.py 
resultl = 1 + 1 


print(f'resultl: {result1}\n’) 


result2 = 5 + (4 - 1) ++ 2 
print(f'result2: {result2}\n’) 


result3 - [resultl, result2] 
print(f'result3: \n{result3}\n’) 


Output of Script 1.2: Python-as-a-Calculator.py 
resultli: 2 


result2: 45 


result3: 
[2, 45] 


By using the function print (f'some text {variablename}’) we can combine text we want 
to print out in combination with values of certain variables. This gives clear and readable output. 
We will discuss some additional hints for efficiently working with Python scripts in Section 19. 
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1.1.3. Modules 


Modules are Python files that contain functions and variables. You can access these modules and 
make use of their code to solve your problem. 

The standard distribution of Python already comes with a number of built-in modules. To make use 
of their commands you have to import these modules first. Script 1.3 (Module-Math. py) demon- 
strates this with the math module. All content of this module becomes available under the module 
name, or, as in this case, an alias object we labeled someAlias.? You can choose whatever name 
you want, but usually these aliases follow a naming convention. After the import, functions and 
variables are accessed by the dot (.) syntax, which is related to the concept of object orientation 
described in Section 1.8.4. 


Script 1.3: Module-Math.py 


import math as someAlias 


resultl = someAlias.sqrt (16) 
print (f/resultl: {result1}\n’) 


result2 = someAlias.pi 
print (f'Pi: (result2)Wn') 


result3 = someAlias.e 


print (f/Eulers number: {result3}\n’) 


p Output of Script 1.3: Module-Math.py 
resultl: 4.0 


Pi: 3.141592653589793 


Eulers number: 2.718281828459045 


The functionality of Python can also be extended relatively easily by advanced users. This is not 
only useful to those who are able and willing to do this, but also for a novice user who can easily 
make use of a wealth of extensions generated by a big and active community. Since these extensions 
are mostly programmed in Python, everybody can check and improve the code submitted by a user, 
so the quality control works very well. The Anaconda distribution of Python already comes with a 
number of external modules, also called packages, that we need for data analyses. 

On top of the packages that come with the standard installation or Anaconda, there are countless 
packages available for download. If they meet certain quality criteria, they can be published on 
the official "Python Package Index" (PyPI) servers at https://pypi.org/. Downloading and 
installing these packages is simple: Run your command line as explained in Section 1.1.1 and type 


[pip install modulename ] 


There are thousands of packages provided at the PyPI. Here is a list of those we will use through- 
out this book with their official description: 


* wooldridge: "Data sets from Introductory Econometrics: A Modern Approach (6th ed, J.M. 
Wooldridge).” 
* numpy: “NumPy is the fundamental package for array computing with Python.” 


2You can also directly use objects from modules without referencing the modul name or its alias by using the command 
from. We will not use this way of importing, but sometimes it might be more convenient. 
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* pandas: "Powerful data structures for data analysis, time series, and statistics.” 

* pandas datareader: "Data readers extracted from the pandas codebase, should be compat- 
ible with recent pandas versions." 

* statsmodels: "Statistical computations and models for Python." 

* matplotlib: "Python plotting package." 

* scipy: "SciPy: Scientific Library for Python." 

* patsy: "A Python package for describing statistical models and for building design matrices." 

* linearmodels: "Instrumental Variable and Linear Panel models for Python." 


Of course, the installation only has to be done once per computer /user and needs an active internet 
connection. 


1.1.4. File Names and the Working Directory 


There are several possibilities for Python to interact with files. The most important ones are to 
import or export a data file. We might also want to save a generated figure as a graphics file or store 
regression tables as text, spreadsheet, or STEX files. 

Whenever we provide Python with a file name, it can include the full path on the computer. The 
full (i.e. "absolute") path to a script file might be something like 


/Users/MyPyProject/MyScript.py 


on a Mac or Linux system. The path is provided for Unix based operating systems using forward 
slashes. If you are a Windows user, you usually use back slashes instead of forward slashes, but the 
Unix-style will also work in Python. On a Windows system, a valid path would be 


c: /Users/MyUserName/Document s/MyPyProject /MyScript .py 


If we do not provide any path, Python will use the current “working directory” for reading or 
writing files. After importing the module os, it can be obtained by the command os . getcwd (). To 
change the working directory, use the command os . chdir (path). Relative paths, are interpreted 
relative to the current working directory. For a neat file organization, best practice is to generate a 
directory for each project (say MyPyProject) with several sub-directories (say PyScripts, data, 
and figures). At the beginning of our script, we can use os . chdir (' /Users/MyPyProject') 
and afterwards refer to a data set in the respective sub-directory as data/MyData.csv and to a 
graphics file as figures/MyFigure.png 3 


1.1.5. Errors and Warnings 


Something you will experience very soon when starting to work with Python (or any other similar 
software package) is that you will make mistakes. The main difference to learning to ride a bicycle 
is that when learning to use Python, mistakes will not hurt. Another difference is that even people 
who have been using Python for years make mistakes all the time. 

Many mistakes will cause Python to complain in the form of error messages or warnings. An 
important part of learning Python is to roughly get an idea of what went wrong from these messages. 
Here is a list of frequent error messages and warnings you might get: 


3For working with data sets, see Section 1.3. 
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NameError: name 'x' is not defined: We have tried to use a variable x that isn't 
defined (yet). Could also be due to a typo in the variable name. 

FileNotFoundError: [Errno 2] No such file or directory: ‘data.csv’: 
Python wasn't able to open the file. Check the working directory, path, file name. 
ModuleNotFoundError: No module named 'xyz': We mistyped the module name. Or 
the required module is not installed on the computer. In this case, install it as described in 
Section 1.1.3. 


There are countless other error messages and warnings you may encounter. Some of them are easy 
to interpret, but others might require more investigative prowess. Often, the search engine of your 
choice will be helpful. 


1.1.6. Other Resources 


There are many useful resources helping to learn and use Python. Useful books on Python in general 
include Downey (2015), Matthes (2015), Barry (2016) and many others. Oliphant (2007) introduces 
Python for scientific computing and Guido and Mueller (2016) narrow it down to data science. 

Since Python has a very active user community, there is also a wealth of information available for 
free on the internet. Here are some suggestions: 


The official Python Tutorial 

https://docs.python.org/3/tutorial/index.html 

Additional links to external resources like tutorials and books 
https://wiki.python.org/moin/BeginnersGuide 

The links to module documentations available at the Python Package Index 
https://pypi.org 

Quantitative economic modeling with Python 

https://python.quantecon.org/ 

Stack Overflow: A general discussion forum for programmers, including many Python users 
https://stackoverflow.com 

Cross Validated: Discussion forum on statistics and data analysis with an active Python com- 
munity 

https://stats.stackexchange.com 
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1.2. Objects in Python 


Python can work with numbers, lists, arrays, texts, data sets, graphs, functions, and many more 
objects of different types. This section covers the most important ones we will frequently encounter 
in the remainder of this book. We will first introduce built-in objects that are available with the 
standard distribution of Python. In the second part we cover objects included in the modules numpy 
and pandas. 


1.2.1. Variables 


We have already observed Python doing some basic arithmetic calculations. From Script 1.2 
(Python-as-a-Calculator.py),the general approach of Python should be self-explanatory. Fun- 
damental operators include +, -, «, / for the respective arithmetic operations and parentheses ( and 
) that work as expected. 

We will often want to store results of calculations to reuse them later. For this, we can assign any 
result to a variable. A variable has a name and by this name you can access the assigned object. We 
can freely choose the variable name given certain rules — they have to start with a (small or capital) 
letter and include only letters, numbers, and the underscore character " ". Python is case sensitive, 
so x and X are different variables. 

You already saw how variables are used to reference objects in 
Script 1.2 (Python-as-a-Calculator.py): The content of an object is assigned using =. In order 
to assign the result of 1 + 1 to the variable result1, type resultl = 1 + 1. 

A new object is created, which includes the value 2. After assigning it to result1, we can use 
result1 in our calculations. If there was a variable with this name before, its content is overwritten. 

A list of all currently defined variable names is shown in the “Variable explorer” window in 
Spyder, see Figure 1.3 (top right by default). You can also use the command dir to do this. Removing 
a previously defined variable (for example x) from the workspace is done using del x. 

Up to now, we assigned results of arithmetic operations to variables. In the next sections, we will 
introduce more complex types of objects like texts, arrays, lists, data sets, function definitions, and 
estimation results. 


1.2.2. Objects in Python 


You might wonder what kind of objects we have dealt with so far. Script 1.4 (Ob jects-in-Python.py) 
shows how to figure this out by using the command type: 


p M — — ——— Script 1.4: Objects-in-Python.py 
resultl- 1 + 1 

# determine the type: 
type resultl - type(resultl) 

# print the result: 

print(f'type resultl: (type resultl)') 


result2 - 2.5 
type result2 - type(result2) 
print(f'type result2: (type result2)') 


result3 = 'To be, or not to be: that is the question’ 
type result3 - type(result3) 
print(f'type result3: (type result3)in') 
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Table 1.1. Logical Operators 


x==y xis equal toy x!-y x is NOT equal to y 

x<y xis less than y not b NOT b (i.e. True, if b is False) 
x<=y xis less than or equal to y a orb Either a orb is True (or both) 
x>y xis greater than y a and b Both a and b are True 


x>=y xis greater than or equal to y 


_______ Output of Script 14: 0bjects-in-Python.py 
type resultl: «class 'int'» 

type result2: «class 'float'» 

type result3: «class 'str'» 


The command type tells us that we have created integers (int), floating point numbers (float) 
and text objects (str). The data type not only defines what values can be stored, but also the actions 
you can perform on these objects. For example, if you want to add an integer to result3, Python 
will return: 


TypeError: can only concatenate str (not "int") to str 


Scalar data types like int, float or str contain only one single value. A Boolean value, also 
called logical value, is another scalar data type that will become useful if you want to execute code 
only if one or more conditions are met. An object of type bool can only take one of two values: 
True or False. The easiest way to generate them is to state claims which are either true or false 
and let Python decide. Table 1.1 lists the main logical operators. 

As we saw in previous examples, scalar types differ in what kind of data they can be used for: 


* int: whole numbers, for example 2 or 5 

* float: numbers with a decimal point, for example 2.0 or 4.95 

* str: any sequence of characters delimited by either single or double quotes, for example ' ab’ 
or "abc" 

* bool: either True or False 


For statistical calculations, we obviously need to work with data sets including many numbers or 
texts instead of scalars. The simplest way we can collect components (even components of different 
types) is called a List in Python terminology. To define a list, we can collect different values 
using [valuel, value2, ...]. You can access a list entry by providing the position (starting at 
0) within square brackets next to the variable name referencing the list (see Script 1.6 (Lists.py) 
for an example). You can also access a range of values by using their start position i and end position 
j with the syntax listname[i: (j*1)]. 

There are two types of actions you can do with lists (or other objects): apply a function or a 
method. We will go into details in Section 1.8.4, and here just demonstrate the different syntax of 
function and method calls. The examples in Script 1.6 (Lists.py) should help to understand the 
concept and use of a list. Script 1.5 (Lists-Copy.py) in the appendix demonstrates how to work 
with a copy of a list. By default you will not work on a copy when assigning it to another variable, 
but the underlying object. For a list, use [:] to create a copy. 
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Script 1.6: Lists.py 
# define a list: 

example_list = [1, 5, 41.3, 2.0] 
print(f'type(example list): {type (example_list)}\n’) 


# access first entry by index: 
first_entry = example_list [0] 
print(f'first entry: {first_entry}\n’) 


# access second to fourth entry by index: 
range2to4 - example list[1:4] 
print(f'range2to4: {range2to4}\n’) 


# replace third entry by new value: 
example list[2] - 3 
print(f'example list: (example list)Wn') 


# apply a function: 
function output - min(example list) 
print(f'function output: (function output)Vn') 


# apply a method: 
example list.sort() 
print(f'example list: (example list)in') 


4 delete third element of sorted list: 
del example list[2] 
print(f'example list: (example list)Wn') 


p — — — — — — Output of Script 1.6: Lists.py 
type(example list): «class 'list'» 


first entry: 1 

range2to4: [5, 41.3, 2.0] 
example list: [1, 5, 3, 2.0] 
function output: 1 


example list: [1, 2.0, 3, 5] 


example list: [1, 2.0, 5] 


A key characteristic of a List is the order of included components. This order allows you to access 
its components by a position. Dictionaries (dict) are unordered sets of components. You access 
components by their unique keys. Script 1.8 (Dicts.py) demonstrates their definition and some 
basic operations. Working on a copy is demonstrated in the appendix (Script 1.7 (Dict s-Copy . py)). 
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Script 1.8: Dicts.py 


# define and print a dict: 

varl = ['Florian', 'Daniel'] 

var2 = [96, 49] 

var3 = [True, False] 

example dict = dict (name=varl, points-var2, passed-var3) 
print(f'example dict: Wn(example dict) Wn') 


# another way to define the dict: 
example dict2 = {/name’: varl, ‘points’: var2, ‘passed’: var3) 
print(f'example dict2: \n{example_dict2}\n’) 


# get data type: 
print(f'type(example dict): {type (example_dict)}\n’) 


# access ‘points’: 
points all = example dict['points'] 
print(f'points all: (points all)Vn') 


# access ‘points’ of Daniel: 
points daniel - example dict['points'][1] 
print(f'points daniel: (points daniel)Tn') 


# add 4 to ‘points’ of Daniel and let him pass: 

example dict['points'][1] = example dict['points'][1] + 4 
example dict['passed'][1] = True 

print(f'example dict: \n{example_dict}\n’) 


# add a new variable ‘grade’: 
example dict['grade'] - [1.3, 4.0] 


# delete variable ‘points’: 
del example dict['points'] 
print(f'example dict: \n{example_dict}\n’) 


Output of Script 1.8: Dicts.py 
example dict: 
{' name’: ['Florian', 'Daniel'], ‘points’: [96, 49], ‘passed’: [True, 
example dict2: 
('name': ['Florian', 'Daniel'], ‘points’: [96, 49], ‘passed’: [True, 
type(example dict): «class 'dict'» 
points all: [96, 49] 
points daniel: 49 
example dict: 
{' name’: ['Florian', 'Daniel'], ‘points’: [96, 53], ‘passed’: [True, 
example dict: 
['name': ['Florian', 'Daniel'], ‘passed’: [True, True], ‘grade’: 


False]) 


False]) 


True]) 


[1.3, 4.01] 


There are many more important data types and we covered only the ones relevant for this book. 
Table 1.2 summarizes these built-in data types plus a simple example in case you have to look them 


up later. 
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Table 1.2. Python Built-in Data Types 


Python type Data Type Example 

int Integer a=5 

float Floating Point Number a = 5.3 

str String a = ‘abc’ 

bool Boolean a = True 

list List a = [1, 3, 1.5] 

dict Dict a = ('b':[1,2], ‘c’:[5,3]} 


1.2.3. Objects in numpy 


Before you start working with numpy, make sure that you have the Anaconda distribution or install 
numpy as explained in Section 1.1.3. For more information about the module, see Walt, Colbert, 
and Varoquaux (2011). It is standard to import the module under the alias np when working with 
numpy, so the first line of code always is: 


import numpy as np 


The most important data type in numpy is the multidimensional array (ndarry). We will first 
introduce the definition of this data type as well as the basics of accessing and manipulating ar- 
rays. Second, we will demonstrate functions and methods that become useful when working on 
econometric problems. 

To create a simple array, provide a list to the function np.array. You can also create a two- 
dimensional array by providing multiple lists within square brackets.* Instead of a two-dimensional 
array, we will often call this data type a matrix. Matrices are important tools for econometric analyses. 
Appendix D of Wooldridge (2019) introduces the basic concepts of matrix algebra. 

The syntax for defining a numpy array is: 


testarraylD - np.array( list ) 
testarray2D - np.array( [ listl, list2, list3 ] ) 


Within a provided list, the numpy array requires a homogeneous data type. If you enter lists 
including elements of different type, numpy will convert them to a homogeneous data type (for 
example, np.array( ['a', 2] ), becomes an array of strings). 

Indexing one-dimensional arrays is similar to the procedure with the data type list. Two- 
dimensional arrays are accessed by two comma separated values within the square brackets. The 
first number gives the row, the second number gives the column (starting at 0 for the first row or 
column). Just as with a list, accessing ranges of values with ":" excludes the upper limit. There 
are a lot more possibilities and Script 1.9 (Numpy-Arrays.py) demonstrates some of them. 


You can use higher dimensional arrays by typing more square brackets, but we will not need more than two dimensions in 
what follows. 
5The stripped-down European and African textbook Wooldridge (2014) does not include the Appendix on matrix algebra. 
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Script 1.9: Numpy-Arrays.py 


import numpy as np 


# define arrays in numpy: 
testarraylD - np.array([1, 5, 41.3, 2.0]) 
print(f'type(testarraylD): {type(testarray1D) }\n’) 


testarray2D = np.array([[4, 9, 8, 3], 
I2, 6, 3, 2], 
(1, 1, 7, 411) 


# get dimensions of testarray2D: 
dim = testarray2D. shape 
print (f'dim: (dim)Wn') 


# access elements by indices: 
third elem = testarraylD[2] 
print(f'third elem: (third elem) n') 


second third elem = testarray2D[1, 2] # element in 2nd row and 3rd column 
print(f'second third elem: (second third elem)in') 


second to third col - testarray2D[:, ] # each row in the 2nd and 3rd column 
print(f'second to third col: WAn(second to third col)Wn') 


# access elements by lists: 
first third elem = testarraylD[[O, 2]] 
print(f'first third elem: (first third elem) n') 


# same with Boolean lists: 
first third elem2 - testarraylD[[True, False, True, Fa: 
print(f'first third elem2: (first third elem2)Wn') 


k= 


[True, False, True, False]]) 
elem by index = testarray2D[k] # lst elem in 1st row, 3rd elem in 2nd row... 
print(f'elem by index: {elem by index}\n’) 


Output of Script 1.9: Numpy-Arrays.py 


type(testarraylD): «class ‘numpy.ndarray’> 
dim: (3, 4) 

third elem: 41.3 

second third elem: 3 

second to third col: 

[I9 8] 

[6 3] 

{1 71] 

first third elem: [ 1. 41.3] 


first third elem2: [ 1. 41.3] 


elem by index: [4 3 1 7] 
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numpy has also some predefined and useful special cases of one and two-dimensional arrays. We 
show some of them in Script 1.10 (Numpy-SpecialCases.py). 


~~ ——— Script 1.10: Numpy-SpecialCases.py 
import numpy as np 


# array of integers defined by the arguments start, end and sequence length: 
sequence - np.linspace(0, 2, num-11) 
print(f'sequence: \n{sequence}\n’) 


# sequence of integers starting at 0, ending at 5-1: 
sequence int - np.arange(5) 
print(f'sequence int: \n{sequence_int}\n’) 


# initialize array with each element set to zero: 
zero array - np.zeros((4, 3)) 
print(f'zero array: \n{zero_array}\n’) 


# initialize array with each element set to one: 
one array = np.ones((2, 5)) 
print(f'one array: \n{one_array}\n’) 


# uninitialized array (filled with arbitrary non: 
empty array - np.empty((2, 3)) 
print(f'empty array: \n{empty array) Wn') 


elements): 


Output of Script 1.10: Numpy-SpecialCases.py 
sequence: 
[0. 0.2 0.4 0.6 0.8 1. 1.2 1.4 1.6 1.8 2. ] 


sequence int: 


[012 3 4] 

zero array: 

[[0. 0. 0.] 

[0. 0. 0.] 

[0. 0. 0.] 

[0. 0. 0.]] 

one array: 

(fl. 1. 1. 1. 1.] 
[Is 15 3 95 Ur.) 


empty array: 
[[71.72723371e-077 -1.72723371e-077 -1.73060214e-077] 
[-1.49457718e-154 2.24183079e-314 4.17201348e-309]] 


Table 1.3 lists important functions and methods in numpy. We can apply them to the data type 
ndarray, but they usually work for many built-in types too. Functions are often vectorized meaning 
that they are applied to each of the elements separately (in a very efficient way). Methods on an 
object referenced by x are invoked by using the x. somemethod() syntax discussed above. Script 
1.11 (Numpy-Operations.py) provides examples to see them in action. We will see in Section 1.5 
how to obtain descriptive statistics with numpy. 
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Table 1.3. Important numpy Functions and Methods 
add(x, y) orxty Element-wise sum of all elements in x and y 


subtract(x, y) orx-y  Element-wise subtraction of all elements in x and y 
divide(x, y) orx/y Element-wise division of all elements in x and y 
multiply(x, y) orx«y Element-wise multiplication of all elements in x and y 


exp (x) Element-wise exponential of all elements in x 
sqrt (x) Element-wise square root of all elements in x 

log (x) Element-wise natural logarithm of all elements in x 
linalg. inv (x) Inverse of x 

x. sum() Sum of all elements in x 

x.min() Minimum of all elements in x 

x.max() Maximum of all elements in x 

x.dot (y) or x@y Matrix multiplication of x and y 

x.transpose() orx.T — Transpose of x 


numpy has a powerful matrix algebra system. Basic matrix algebra includes: 


* Matrix addition using the operator + as long as the matrices have the same dimensions. 
* The operator * does not do matrix multiplication but rather element-wise multiplication. 
* Matrix multiplication is done with the operator @ (or the dot method) as long as the dimen- 


sions of the matrices match. 
* Transpose of a matrix X: as X. T 
* Inverse of a matrix X: as Linalg. inv (X) 


The examples in Script 1.11 (Numpy-Operations.py) should help to understand the workings 
of these basic operations. In order to see how the OLS estimator for the multiple regression model 


can be calculated using matrix algebra, see Section 3.2. 


———— SÍ Script 1.11: Numpy-Operations.py 
import numpy as np 


# define an arrays in numpy: 


matl - np.array([[4, 9, 8], 
I2, 6, 31D 
mat2 = np.array([[1, 5, 2], 
[6, 6, 0], 
(4, 8, 31) 


4 use a numpy function: 
resultl = np.exp(mat1) 
print(f'resultl: \n{result1}\n’) 


print (f/result2: \n{result2}\n’) 


# use a method: 
matl tr = matl.transpose() 
print(f'matl tr: \n{mat1_tr}\n’) 


# matrix algebra: 
matprod = matl.dot(mat2) # same as matl @ mat2 
print(f'matprod: \n{matprod}\n’ ) 


result2 = matl + mat2[[0, 1]] # same as np.add(mat1, mat2[[0, 1]]) 
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Output of Script 1.11: Numpy-Operations.py 
result: 
[[5.45981500e+01 8.10308393e403 2.98095799e«03] 
[7.38905610e+00 4.03428793e402 2.00855369e*01]] 


result2: 
TE 5 14 10] 
[812 31] 


matl tr: 
((4 2] 
[9 6] 
[8 3]] 


matprod: 
[[ 90 138 32] 
[50 70 13] 


1.2.4. Objects in pandas 


The module pandas builds on top of data types introduced in previous sections and allows us to 
work with something we will encounter almost every time we discuss an econometric application: 
a data frame. A data frame is a structure that collects several variables and can be thought of as a 
rectangular shape with the rows representing the observational units and the columns representing 
the variables. A data frame can contain variables of different data types (for example a numerical 
list, a one-dimensional ndarray, str and so on). Before you start working with pandas, make 
sure that it is installed. The standard alias of this module is pd, so when working with pandas, the 
first line of code always is: 


[import pandas as pd 


The most important data type in pandas is DataFrame, which we will often simply refer to 
as "data frame". One strength of pandas is the existence of a whole set of operations that work 
on the index of a DataFrame. The index contains information on the observational unit, like the 
person answering a questionnaire or the date of a stock price you want to work with. Script 1.12 
(Pandas.py) shows the definition of a variable with data type DataFrame by providing a dict 
to the function pd.DataFrame. The definition of an index, in this example a date with monthly 
frequency (£req-' M'), is also demonstrated. Accessing elements of a variable df referencing an 
object of data type DataFrame can be done in multiple ways: 

Access columns/ variables by name: 

df['varnamel'] ordf[['varnamel', 'varname2',...]] 

Access rows/ observations by integer positions i to j: d£[i: (j*1)] (also works with the 
index names of d£) 

Access variables and observations by names: 

df.loc['rowname', 'colname'] 

Access variables and observations by row and column integer positions i and j: 

df.iloc[i, jl 


5For more information about the module, see McKinney (2011). 
"The module pandas is part of the Anaconda distribution. 
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If you define a DataFrame by a combination of several DataFrames, they are automatically 
matched by their indices. 


Script 1.12: Pandas.py 


import numpy as np 
import pandas as pd 


# define a pandas DataFrame: 

icecream sales = np.array([30, 40, 35, 130, 120, 60]) 

weather coded = np.array([0, 1, 0, 1, 1, 0]) 

customers - np.array([2000, 2100, 1500, 8000, 7200, 2000]) 

df - pd.DataFrame(('icecream sales': icecream sales, 
'weather coded': weather coded, 
‘customers’: customers}) 


# define and assign an index (six ends of month starting in April, 2010) 
# (details on generating indices are given in Chapter 10): 

ourIndex = pd.date_range(start='04/2010’, freq-'M', periods=6) 

df.set index(ourIndex, inplace-True) 


# print the DataFrame 
print(f'df: Mn(df)Wn') 


# access columns by variable names: 
subsetl - df[['icecream sales', 'customers']] 
print(f'subsetl: \n{subset1}\n’) 


cond to fourth row: 
df[1:4] # same as df['2010-05-31' :'2010-07-31'] 
print(f'subset2: Wn(subset2)Wn') 


4 access rows and columns by index and variable names: 
subset3 = df.loc['2010-05-31', 'customers'] # same as df.iloc[1,2] 
print(f'subset3: \n{subset3}\n’) 


# access rows and columns by index and variable integer positions: 
subset4 = df.iloc[1:4, 1 
# same as df.10c['/2010-05-31':'2010-07-31', ['icecream sales','weather']] 
print(f'subset4: \n{subset4}\n’) 


Output of Script 1.12: Pandas.py 


df: 

icecream sales weather coded customers 
2010-04-30 30 0 2000 
2010-05-31 40 1 2100 
2010-06-30 35 0 1500 
2010-07-31 130 1 8000 
2010-08-31 120 1 7200 
2010-09-30 60 0 2000 
subset 1: 

icecream_sales customers 
2010-04-30 30 2000 
2010-05-31 40 2100 
2010-06-30 35 1500 
2010-07-31 130 8000 
2010-08-31 120 7200 


2010-09-30 60 2000 
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subset2: 

icecream_sales weather_coded customers 
2010-05-31 40 1 2100 
2010-06-30 35 0 1500 
2010-07-31 130 1 8000 
subset3: 
2100 
subset4: 

icecream sales weather coded 
2010-05-31 40 1 
2010-06-30 35 0 
2010-07-31 130 1 


Table 1.4. Important pandas Methods 


df.head() First 5 observations in d£ 

df.tail() Last 5 observations in df 

df.describe() Print descriptive statistics 

df.set index(x) Set the index of d£ as x 

df['x'] ordf.x Access x in d£ 

df.iloc(i, j) Access variables and observations in df by integer position 
df.loc(names i, names j) Access variables and observations in df by names 
df['x'].shift(i) Creates a by i rows shifted variable of x 
df['x'].diff(i) Creates a variable that contains the ith difference of x 


df.groupby('x').function() Apply a function to subgroups of d£ according to x 


Many economic variables of interest have a qualitative rather than quantitative interpretation. 
They only take a finite set of values and the outcomes don't necessarily have a numerical meaning. 
Instead, they represent qualitative information. Examples include gender, academic major, grade, 
marital status, state, product type or brand. In some of these examples, the order of the outcomes 
has a natural interpretation (such as the grades), in others, it does not (such as the state). 

As a specific example, suppose we have asked our customers to rate our product on a scale between 
0 (=“bad”), 1 (=“okay”), and 2 (=“good”). We have stored the answers of our ten respondents in 
terms of the numbers 0,1, and 2 in a list. We could work directly with these numbers, but often, it 
is convenient to use so-called data type Categorical. One advantage is that we can attach labels 
to the outcomes. We extend a modified example in Script 1.13 (Pandas-Operations.py), where 
the variable weather is coded and demonstrate how to assign meaningful labels. The example also 
includes some methods from Table 1.4, i.e. lag variables and calling methods on subgroups of the 
data frame. The comments explain the effect of the respective action: 
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Script 1.13: Pandas-Operations.py 
import numpy as np 
import pandas as pd 


# define a pandas DataFrame: 

icecream sales = np.array([30, 40, 35, 130, 120, 60]) 

weather coded - np.array([0, 1, 0, 1l, 1, 0]) 

customers - np.array([2000, 2100, 1500, 8000, 7200, 2000]) 

df = pd.DataFrame(('icecream sales': icecream sales, 
weather coded': weather coded, 
‘customers’: customers}) 


# define and assign an index (six ends of month starting in April, 2010) 
# (details on generating indices are given in Chapter 10): 

ourIndex = pd.date range(start-'04/2010', freq-'M', periods=6) 

df.set index(ourIndex, inplace-True) 


# include sales two months ago: 
df['icecream sales lag2'] - df['icecream sales'].shift(2) 
print(f'df: \n{d£}\n’) 


# use a pandas.Categorical object to attach labels (0 = bad; 1 = good): 
df['weather'] = pd.Categorical.from codes(codes-df['weather coded'], 
categories-['bad', 'good']) 


print(f'df: \n{d£}\n’) 


# mean sales for each weather category: 
group means = df.groupby('weather').mean() 
print(f'group means: \n{group_means} \n’) 


— — — Output of Script 1.13: Pandas-Operations.py — = 


df: 

icecream sales weather coded customers icecream_sales_lag2 
2010-04-30 30 0 2000 NaN 
2010-05-31 40 1 2100 NaN 
2010-06-30 35 0 1500 30.0 
2010-07-31 130 1 8000 40.0 
2010-08-31 120 1 7200 35.0 
2010-09-30 60 0 2000 130.0 
d£: 

icecream sales weather coded icecream sales lag2 weather 
2010-04-30 30 0 NaN bad 
2010-05-31 40 z NaN good 
2010-06-30 35 0 30.0 bad 
2010-07-31 130 £ 40.0 good 
2010-08-31 120 1 35.0 good 
2010-09-30 60 0 130.0 bad 
[6 rows x 5 columns] 
group means: 

icecream sales weather coded ^ customers icecream sales lag2 

weather 
bad 41.666667 0.0 1833.333333 80.0 
good 96.666667 1.0 5766.666667 37.5 
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1.3. External Data 


In previous sections, we entered all of our data manually in the script files. This is a very untypical 
way of getting data into our computer and we will introduce more useful alternatives. These are 
based on the fact that many data sets are already stored somewhere else in data formats that Python 
can handle. 


1.3.1. Data Sets in the Examples 


We will reproduce many of the examples from Wooldridge (2019). The companion web site of 
the textbook provides the sample data sets in different formats. If you have an access code that 
came with the textbook, they can be downloaded free of charge. The Stata data sets are also made 
available online at the “Instructional Stata Datasets for econometrics” collection from Boston College, 
maintained by Christopher F. Baum.* 

Fortunately, we do not have to download each data set manually and import them by the functions 
discussed in Section 1.3.2. Instead, we can use the external module wooldridge. It is not part of 
the Anaconda distribution and you have to install wooldridge as explained in Section 1.1.3. When 
working with wooldridge, the first line of code always is: 


import wooldridge as woo 


Script 1.14 (Wooldridge.py) demonstrates the first lines of a typical example in this book. As 
you see, we are dealing with a pandas data type, so all the methods from the previous section are 
applicable. 


Script 1.14: Wooldridge.py 


import wooldridge as woo 


# load data: 
wagel = woo.dataWoo('wagel') 


# get type: 
print (f' type (wagel): \n{type (wagel) }\n’) 


# get an overview: 
print(f'wagel.head(): \n{wagel.head()}\n’) 


— — — — — — — —— Output of Script 1.14: Wooldridge.py — — ————— —— —— —, 
type (wagel): 
<class 'pandas.core.frame.DataFrame'» 


wagel.head(): 

wage educ exper tenure ... servocc lwage expersq tenursq 
0 3.10 11 2 0 0 1.131402 4 0 
1 3.24 12 22 2 1 1.175573 484 4 
2 3.00 11 2 0 0 1.098612 4 0 
3 6.00 8 44 28 0 1.791759 1936 784 
4 5.30 12 7 2 0 1.667707 49 4 


[5 rows x 24 columns] 


The address is https : //econpapers.repec.org/paper/bocbocins/. 
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Figure 1.5. Examples of Text Data Files 


(a) sales.txt (b) sales.csv 
year productl product2 product3 2008,0,1,2 
2008012 SPECE 
2009324 eed 
2010,6,3,4 
2010 6 3 4 
2011,9,5,2 
XE aia 
2013 8 6 2 2013,8/6,2 


1.3.2. Import and Export of Data Files 


Probably all software packages that handle data are capable of working with data stored as text files. 
This makes them a natural way to exchange data between different programs and users. Common file 
name extensions for such data files are RAW, CSV or TXT. Most statistics and spreadsheet programs 
come with their own file format to save and load data. While it is basically always possible to 
exchange data via text files, it might be convenient to be able to directly read or write data in the 
native format of some other software. 

Fortunately, the pandas toolbox provides the possibility for importing and exporting data from/to 
text files and many programs. This includes, for example, 

* Text file (TXT) with read table and to table, 

* CSV (CSV) with read csv and to csv, 

* MS Excel (XLS and XLSX) with read exceland to excel, 
e Stata (DTA) with read stataandto stata, 

* SAS (XPORT and SSD) with read sas and to sas. 

Figure 1.5 shows two flavors of a raw text file containing the same data. The file sales.txt 
contains a header with the variable names. In file sales.csv, the columns are separated by a 
comma. 

Text files for storing data come in different flavors, mainly differing in how the columns of 
the table are separated. The pandas commands read table and read csv provides possibil- 
ities for reading many flavors of text files which are then stored as a DataFrame. Script 1.15 
(Import-Export .py) demonstrates the import and export of the files shown in Figure 15. In 
this example, data files are stored in and exported to the folder data. 
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Script 1.15: Import-Export.py 
import pandas as pd 


# import csv with pandas: 

dfl = pd.read csv('data/sales.csv', delimiter-',', header=None, 
names-['year', 'productl', 'product2', 'product3']) 

print(f'dfl: \n{d£1}\n’) 


# import txt with pandas: 
df2 = pd.read table('data/sales.txt', delimiter-' ') 
print(f'df2: Mn(df2)Wn') 


# add a row to dfl: 

d£3 = dfl.append(('year': 2014, 'productl': 10, 'product2': 8, 'product3': 2), 
ignore index-True) 

print(f'df3: \n{d£3}\n’) 


# export with panda: 
d£3.to csv('data/sales2.csv') 


Output of Script 1.15: Import-Export.py 


dfi: 

year productl product2 product3 
0 2008 0 1 2 
1 2009 3 2 4 
2 2010 6 3 4 
3 2011 9 5 2 
4 2012 7 9 3 
5 2013 8 6 2 
df2: 

year productl product2 product3 
0 2008 0 1 2 
1 2009 3 2 4 
2 2010 6 3 4 
3 2011 9 5 2 
4 2012 7 9 3 
5 2013 8 6 2 
d£3: 

year productl product2 product3 
0 2008 0 1 2 
1 2009 3 2 4 
2 2010 6 3 4 
3 2011 9 5 2 
4 2012 7 9 3 
5 2013 8 6 2 
6 2014 10 8 2 
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The command read csv includes many optional arguments that can be added. Many of these 
arguments are detected automatically by pandas, but you can also specify them explicitly. The most 
important arguments are: 

* header: Integer specifying the row that includes the variable names. Can also be None. 

e sep: Often columns are separated by a comma, i.e. sep=',’ (default). Instead, an arbitrary 
other character can be given. sepz' ;’ might be another relevant example of a separator. 

* names: If no header is specified, you can provide a List of variable names. 

* index col: The values in column index col are used as an index. 


1.3.3. Data from other Sources 


The last part of this section deals with importing data from other sources than local files on your com- 
puter. We will use an extension of pandas called pandas datareader, which makes it straight- 
forward to query online databases. It is not part of the Anaconda distribution and you have to install 
pandas datareader as explained in Section 1.1.3. Script 1.16 (Import-StockData.py) demon- 
strates the workflow of importing stock data of Ford Motor Company. All you have to do is specify 
start and end date and the data source, which is Yahoo Finance in this case. 


p Script 1.16: Import-StockData.py 
import pandas datareader as pdr 


# download data for 'F' (= Ford Motor Company) and define start and end: 
tickers - ['F'] 

start date = ‘2014-01-01’ 

end date = '2015-12-31" 


# use pandas datareader for the import: 
F data - pdr.data.DataReader(tickers, 'yahoo', start date, end date) 


4 look at imported data: 
print(f'F data.head(): \n{F_data.head()}\n’) 
print(f'F data.tail(): \n{F_data.tail()}\n’) 


Output of Script 1.16: Import-StockData.py 
F data.head(): 


Attributes Adj Close Close High Low Open Volume 
Symbols F F F F F F 
Date 


2014-01-02 11.131250 15.44 15.45 15.28 15.42 31528500.0 
2014-01-03 11.181718 15.51 15.64 15.30 15.52 46122300.0 
2014-01-06 11.232182 15.58 15.76 15.52 15.72 42657600.0 
2014-01-07 11.087993 15.38 15.74 15.35 15.73 54476300.0 
2014-01-08 11.203343 15.54 15.71 15.51 15.60 48448300.0 


F data.tail(): 


Attributes Adj Close Close High Low Open Volume 
Symbols F F F F F F 
Date 


2015-12-24 11.082371 14.31 14.37 14.25 14.35  9000100.0 
2015-12-28 10.981693 14.18 14.34 14.16 14.28 13697500.0 
2015-12-29 11.020413 14.23 14.30 14.15 14.28 18867800.0 
2015-12-30 10.973947 14.17 14.26 14.12 14.23 13800300.0 
2015-12-31 10.911990 14.09 14.16 14.04 14.14 19881000.0 
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1.4. Base Graphics with matplotlib 


The module matplotlib is a popular and versatile tool for producing all kinds of graphs in Python. 
In this section, we discuss the overall base approach for producing graphs and the most important 
general types of graphs. We will only scratch the surface of matplotlib, but you will see most 
of the graph producing commands relevant for this book. For more information, see Hunter (2007). 
Some specific graphs used for descriptive statistics will be introduced in Section 1.5. 

Before you start producing your own graphs, make sure that you use the Anaconda distribution 
or install matplot lib as explained in Section 1.1.3. When working with matplotlib, the first line 
of code always is: 


import matplotlib.pyplot as plt 


1.4.1. Basic Graphs 


One very general type is a two-way graph with an abscissa and an ordinate that typically represent 
two variables like X and Y. 

If we have data in two lists x and y, we can easily generate scatter plots, line plots or similar 
two-way graphs. The command plot is capable of these types of graphs and we will see some 
of the more specialized uses later on. Script 1.17 (Graphs-Basics.py) generates Figure 1.6(a) 
and demonstrates the minimum amount of code to produce a black line plot with all other options 
on default. Graphs are displayed in a separate Python window.’ The last two lines export the 
created plot as a PDF file to the folder PyGraphs and reset the plot to create a completely new 
one. If the folder PyGraphs does not exist yet you must create one first to execute Script 1.17 
(Graphs-Basics.py) without error. 


Script 1.17: Graphs-Basics.py - = — - 
import matplotlib.pyplot as plt 


# oc 
z= 9] 
y= 8] 


# plot and save: 

plt.plot(x, y, color-'black') 
plt.savefig('PyGraphs/Graphs-Basics-a.pdf') 
plt.close() 


Two important arguments of the plot command are linestyle and marker. The argument 
linestyle takes the values ' -' (the default), ' ——', ' :' , and many more. The argument marker 
is empty by default, and can take ' o', ' v', and many more. Some resulting plots are shown in 
Figure 1.6. The code is shown in the appendix in Script 1.18 (Graphs-Basics2.py). 

The plot command can be used to create a function plot, i.e. function values y = f (x) are plotted 
against x. To plot a smooth function, the first step is to generate a fine grid of x values. In Script 1.19 
(Graphs-Functions.py) we choose linspace from numpy and control the number of x values 
with num.!° The following plotting of the function works exactly as in the previous example. We 
choose the quadratic function plotted in Figure 1.7(a) and the standard normal density (see Section 
1.6) in Figure 17(b). 

?If creating your graph requires the execution of multiple lines of code, make sure to execute them all at once and not line 
by line. Otherwise you might get multiple plots instead of one. 
The module scipy will be introduced in Section 1.6. 
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Figure 1.6. Examples of Point and Line Plots using plot (x, y) 


(a) see Script 1.17 (Graphs-Basics.py) (b) Linestyle-'—-' 
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Figure 1.7. Examples of Function Plots using plot 
(a)x ** 2 (b) stats . norm. pdf (x) 


p — — — — —— Script 1.19: Graphs-Functions.py 
import scipy.stats as stats 
import numpy as np 

import matplotlib.pyplot as plt 


# support of quadratic function 
in array with 100 equispaced elements from -3 to 2): 
pace(-3, 2, num-100) 

# function values for all these values: 

yl2xl*2 


# plot quadratic function: 
plt.plot(xl, yl, linestyl 
plt.savefig('PyGraphs/Grapl 
plt.close() 


, color=’black’) 
Functions-a. pdf’ ) 


# same for normal density: 
x2 = np.linspace(-4, 4, num=100) 
y2 = stats.norm. pdf (x2) 


# plot normal density: 
plt.plot(x2, y2, linestyle-'-', color-'black') 
plt.savefig('PyGraphs/Graphs-Functions-b.pdf') 


1.4.2. Customizing Graphs with Options 


As already demonstrated in the examples, these plots can be adjusted very flexibly. A few examples: 
* The width of the lines can be changed using the argument linewidth (default: 
linewidth=1). 
* The size of the marker symbols can be changed using the argument markersize (default: 
markersize=1). 
* The color of the lines and symbols can be changed using the argument color. It can be 
specified in several ways: 
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- By name: ‘blue’, ‘green’, ‘red’, ‘cyan’, ‘magenta’, ‘yellow’, ‘black’, 
‘white’. 

— Gray scale by a string encoding a number between 0 (black) and 1 (white), for example 
plt.plot(xl, yl, linestyle-'-', color-'0.3'). 


— By RGBA values provided by (r, g, b, a) with each letter representing a number between 0 
and 1, for example 
plt.plot(xl, yl, linestyle-' 
useful for fine-tuning colors. 


', color-(0.9, 0.2, 0.1, 0.3)).!" This is 


You can also add more elements to change the appearance of your plot: 


A title can be added using title('My Title’). 

The horizontal and vertical axis can be labeled using xlabel('My x axis label') and 
ylabel('My y axis label'). 

The limits of the horizontal and the vertical axis can be chosen using xlim(min, max) and 
ylim(min, max), respectively. 


For an example, see Script 1.20 (Graphs-BuildingBlocks.py) and Figure 1.8. 


1.4.3. Overlaying Several Plots 


Often, we want to plot more than one set of variables or multiple graphical elements. This is an easy 
task, because each plot is added to the previous one by default.'? 

Script 1.20 (Graphs-BuildingBlocks.py) shows an example that also demonstrates some of 
the options from the previous paragraph. Its result is shown in Figure 1.8.13 


import scipy gia as stats 
import numpy 
import matplotlib pyplot as plt 

# support for all normal densities: 


# get different density evaluations: 


yl = stats.norm.pdf(x, 0, 1) 

y2 tats.norm.pdf(x, 1, 0.5) 

y3 = stats.norm.pdf(x, 0, 2) 

# plot: 

plt.plot(x, yl, linestyle-'-', color-'black', label-'standard normal') 
plt.plot(x, y2, linestyle=’--’, color='0.3’, label-'mu = 1, sigma = 0.5') 
plt.plot(x, y3, linestyle-':', color-'0.6', label=’$\mu = 0$, $\sigma = 2$') 
plt.xlim(-3, 4) 

plt.title('Normal Densities') 

plt .ylabel (’$\phi (x)$') 

plt.xlabel('x') 

plt.legend() 

plt.savefig('PyGraphs/Graphs-BuildingBlocks.pdf') 


Script 1.20: Graphs-BuildingBlocks.py 


np.linspace(-4, 4, num-100) 


"1The RGB color model defines colors as a mix of the components red, green, and blue. a is optional and controls for 
transparency. 

12To avoid this and reset your graph, use the command close after completing a graph. 

1B The module scipy will be introduced in Section 1.6. 
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Figure 1.8. Overlayed Plots 
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In this example, you can also see some useful commands for adding elements to an existing graph. 
Here are some (more) examples: 
* axhline(y-value) adds a horizontal line at y. 
* axvline(x-value) adds a vertical line at x. 


* legend() adds a legend based on the string provided in each graphical element in label. 
matplotlib finds the best position. 
In the legend, but also everywhere within a graph (title, axis labels, ...) we can also use Greek letters, 


equations, and similar features in a relatively straightforward way. This is done using respective TEX 
commands as demonstrated in Script 1.20 (Graphs-BuildingBlocks.py)and Figure 1.8. 


1.4.4. Exporting to a File 


By default, a graph generated in one of the ways we discussed above will be displayed in its own 
window. Python offers the possibility to export the generated plots automatically using specific 
commands. 


Among the different graphics formats, the PNG (Portable Network Graphics) format is very useful 
for saving plots to use them in a word processor and similar programs. For IATẸX users, PS, EPS and 
SVG are available and PDF is very useful. You have already seen the export syntax in many examples: 


plt.savefig('filepath/filename.format') 
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Figure 1.9. Examples of Exported Plots 


(a) plt . figure (figsize=(4, 6)) (b) plt. figure (figsize=(6, 4)) 


To set the width and height of your graph in inches, you start your code with 
plt.figure(figsize-(width, height)). Script 1.21 (Graphs-Export.py) and Figure 
1.9 demonstrate the complete procedure. 


p Script 1.21: Graphs-Export.py 
import scipy.stats as stats 


import numpy as np 
import matplotlib.pyplot as plt 


# support for all normal densiti 
x = np.linspace(-4, 4, num-100) 


4 get different density evaluations: 
yl stats.norm.pdf(x, 0, 1) 
y2 stats.norm.pdf(x, 0, 3) 


4 plot (a): 

plt.figure(figsize-(4, 6)) 

plt.plot(x, yl, linestyle-'-', color-'black') 
plt.plot (x, y2, linestyle-'--', color-'0.3') 
plt.savefig('PyGraphs/Graphs-Export-a.pdf') 

plt.close() 


4 plot (b): 

plt.figure(figsize-(6, 4)) 

plt.plot(x, yl, linestyle-'-', color-'black') 
plt.plot(x, y2, linestyle=’--’, color-'0.3') 
plt.savefig('PyGraphs/Graphs-Export-b.png') 
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1.5. Descriptive Statistics 


The Python modules pandas, numpy and matplotlib offer many commands for descriptive statis- 
tics. In this section, we cover the most important ones for our purpose. 


1.5.1. Discrete Distributions: Frequencies and Contingency Tables 


Suppose we have a sample of the random variables X and Y stored in numpy or pandas data types 
x and y, respectively. For discrete variables, the most fundamental statistics are the frequencies 
of outcomes. The numpy command unique (x, return counts-True) or pandas command 
x.value_counts () returns such a table of counts. If we are interested in the contingency table, i.e. 
the counts of each combination of outcomes for variables x and y, we provide it to the crosstab 
function in pandas. For getting the sample shares instead of the counts, we can change the functions 
argument normalize: 

* The overall sample share: crosstab(x, y, normalize-'all') 

* The share within x values (row percentages): crosstab(x, y, normalize-'index') 

* The share within y values (column percentages): crosstab(x, y, normalize-'columns') 

As an example, we look at the data set af fairs in Script 1.22 (Descr-Tables.py). We demon- 

strate the workings of the numpy and pandas commands with two variables: 

* kids = 1 if the respondent has at least one child 

* ratemarr = Rating of the own marriage (1=very unhappy, ... , 5-very happy) 


pM — —— Script 122: Descr-Tables.py 
import wooldridge as woo 
import numpy as np 

import pandas as pd 


affairs - woo.dataWoo('affairs') 


# adjust codings to [0-4] (Categoricals require a start from 0): 
affairs['ratemarr'] = affairs['ratemarr'] - 1 


# use a pandas.Categorical object to attach labels for "haskids": 


affairs['haskids'] = pd.Categorical.from_codes (affairs [’kids’], 
categories-['no', 'yes']) 
# ... and "marriage" (for example: 0 = ‘very unhappy’, 1 = 'unhappy',...): 
mlab = ['very unhappy’, ‘unhappy’, ‘average’, ‘happy’, ‘very happy'] 
affairs['marriage'] = pd.Categorical.from codes (affairs |[’ratemarr’ ], 
categories=mlab) 


# frequency table in numpy (alphabetical order of elements): 
ft np = np.unique(affairs[’marriage’], return counts-True) 
unique elem np - ft np[0] 

counts np = ft np[1] 

print(f'unique elem np: \n{unique_elem_np}\n’) 
print(f'counts np: \n{counts_np}\n’) 


# frequency table in pandas: 
ft pd = affairs['marriage'].value counts() 
print(f'ft pd: \n{£t_pd}\n’) 
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# frequency table with groupby: 
ft pd2 = affairs['marriage'].groupby (affairs|'haskids']).value counts() 
print(f'ft pd2: \n{ft_pd2}\n’) 


# contingency table in pandas: 

ct all abs - pd.crosstab(affairs['marriage'], affairs['haskids'], margins-3) 
print(f'ct all abs: \n{ct_all_abs}\n’) 

ct all rel - pd.crosstab(affairs['marriage'], affairs['haskids'], normalize-'all') 
print(f'ct all rel: \n{ct_all_rel}\n’) 


# share within "marriage" (i.e. within a row): 
ct row = pd.crosstab(affairs['marriage'], affi 
print(f'ct row: \n{ct_row}\n’) 


rs['haskids'], normalize=’ index’) 


# share within "haskids" (i.e. within a column): 
ct col = pd.crosstab(affairs['marriage'], affairs['haskids'], normalize=’ columns’ )| 
print(f'ct col: \n{ct_col}\n’) 


Output of Script 1.22: Descr-Tables.py 
unique elem np: 
['average' 'happy' 'unhappy' 'very happy' 'very unhappy'] 


counts np: 
[ 93 194 66 232 16] 


ft. pd: 

very happy 232 

happy 194 

average 93 

unhappy 66 

very unhappy 16 

Name: marriage, dtype: int64 

ft. pd2: 

haskids marriage 

no very happy 96 
happy 40 
average 24 
unhappy 8 
very unhappy 3 

yes happy 154 
very happy 136 
average 69 
unhappy 58 
very unhappy 13 


Name: marriage, dtype: int64 


ct all abs: 


haskids no yes All 
marriage 

very unhappy 3 13 16 
unhappy 8 58 66 
average 24 69 93 
happy 40 154 194 
very happy 96 136 232 


All 171 430 601 
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ct_all_rel: 
haskids no yes 
marriage 

very unhappy 0.004992 0.021631 
unhappy 0.013311 0.096506 
average 0.039933 0.114809 
happy 0.066556 0.256240 
very happy 0.159734 0.226290 
ct row: 

haskids no yes 
marriage 

very unhappy 0.187500 0.812500 
unhappy 0.121212 0.878788 
average 0.258065 0.741935 
happy 0.206186 0.793814 
very happy 0.413793 0.586207 
ct. col: 

haskids no yes 
marriage 

very unhappy 0.017544 0.030233 
unhappy 0.046784 0.134884 
average 0.140351 0.160465 
happy 0.233918 0.358140 
very happy 0.561404 0.316279 


In the Python script, we first generate Categorical versions of the two variables of interest 
from the coded values provided by the data set affairs. In this way, we can generate tables with 
meaningful labels instead of numbers for the outcomes, see Section 1.2.4. Then different tables are 
produced. Of the 601 respondents, 430 (=71.5%) have children. Overall, 16 respondents report to 
be very unhappy with their marriage and 232 respondents are very happy. In the contingency table 
with counts, we see for example that 136 respondents are very happy and have kids. 

The table reporting shares within the rows (ct. row) tells us that for example 81.25% of very 
unhappy individuals have children and only 58.6% of very happy respondents have kids. The last 
table reports the distribution of marriage ratings separately for people with and without kids: 56.1% 
of the respondents without kids are very happy, whereas only 31.6% of those with kids report to 
be very happy with their marriage. Before drawing any conclusions for your own family planning, 
please keep on studying econometrics at least until you fully appreciate the difference between 
correlation and causation! 

There are several ways to graphically depict the information in these tables. Script 1.23 
(Descr-Figures.py) demonstrates the creation of basic pie and bar charts using the commands 
pie and bar, respectively. These figures can of course be tweaked in many ways, see the help pages 
and the general discussions of graphics in Section 1.4. The best way to explore the options is to 
tinker with the specification and observe the results. 
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m Script 1.23: Descr-Figures.py 
import wooldridge as woo 
import numpy as np 

import pandas as pd 
import matplotlib.pyplot as plt 


affairs = woo.dataWoo(’affairs’) 


# attach labels (see previous script): 
affairs['ratemarr'] = affairs['ratemarr'] - 1 
affairs['haskids'] = pd.Categorical.from codes (affairs [’ kids’ ], 
categories-['no', 'yes']) 
mlab - ['very unhappy', 'unhappy', 'average', 'happy', 'very happy'] 
affairs['marriage'] = pd.Categorical.from codes(affairs['ratemarr'], 
categories-mlab) 


# counts for all graphs: 

counts = affairs['marriage'].value counts() 

counts bykids = affairs['marriage'].groupby(affairs['haskids']).value counts() 
counts yes = counts bykids['yes'] 

counts no - counts bykids['no'] 


# pie chart (a): 

grey colors = ['0.3', '0.4', '0.5', '0.6', '0.7'] 
plt.pie(counts, labels-mlab, colors-grey colors) 
plt.savefig('PyGraphs/Descr-Pie.pdf') 
plt.close() 


# horizontal bar chart (b): 

y_pos = [0, 1, 2, 3, 4] # the y locations for the bars 
plt.barh(y pos, counts, color-'0.6') 

plt.yticks(y pos, mlab, rotation-60) # add and adjust labeling 
plt.savefig('PyGraphs/Descr-Barl.pdf') 

plt.close() 


4 stacked bar plot (c): 

x pos = [0, 1, 2, 3, 4] # the x locations for the bars 

plt.bar(x pos, counts yes, width-0.4, color='0.6’, label-'Yes') 

4 with 'bottom-counts yes' bars are added on top of previous ones 

plt.bar(x pos, counts no, width-0.4, bottom-counts yes, color-'0.3', label-'No') 
plt.ylabel('Counts') 

plt.xticks(x pos, mlab) # add labels on x axis 

plt.legend() 

plt.savefig('PyGraphs/Descr-Bar2.pdf') 

plt.close() 


4 grouped bar plot (d) 

# add left bars first and move bars to the left: 

x pos leftbar - [-0.2, 0.8, 1.8, 2.8, 3.8] 

plt.bar(x pos leftbar, counts yes, width-0.4, color-'0.6', label-'Yes') 
# add right bars first and move bars to the right: 

x pos rightbar - [0.2, 1.2, 2.2, 3.2, 4.2] 

plt.bar(x pos rightbar, counts no, width=0.4, color-'0.3', label-'No') 
plt.ylabel('Counts') 

plt.xticks(x pos, mlab) 

plt.legend() 

plt.savefig('PyGraphs/Descr-Bar3.pdf') 
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Figure 1.10. Pie and Bar Plots 
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(c) Stacked bar plot 


(d) Grouped bar plot 
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1.5.2. Continuous Distributions: Histogram and Density 


For continuous variables, every observation has a distinct value. In practice, variables which have 
many (but not infinitely many) different values can be treated in the same way. Since each value 
appears only once (or a very few times) in the data, frequency tables or bar charts are not useful. 
Instead, the values can be grouped into intervals. The frequency of values within these intervals can 
then be tabulated or depicted in a histogram. 

In the Python module matplotlib, the function hist (x, options) assigns observations to 
intervals which can be manually set or automatically chosen and creates a histogram which plots 
values of x against the count or density within the corresponding bin. The most relevant options are 

* bins=. . .: Set the interval boundaries: 
— no bins specified: let Python choose number and position. 
— bins=n for a scalar n: select the number of bins, but let Python choose the position. 
- bins-v for a list v: explicitly set the boundaries. 

* density=True: do not use the count but the density on the vertical axis. 

Let's look at the data set CEOSAL1 which is described and used in Wooldridge (2019, Example 
2.3). It contains information on the salary of CEOs and other information. We will try to depict 
the distribution of the return on equity (ROE), measured in percent. Script 1.24 (Histogram. py) 
generates the graphs of Figure 1.11. In Sub-figure (b), the breaks are manually chosen and not 
equally spaced. Setting density=True gives the densities on the vertical axis: The sample share of 
observations within a bin is therefore reflected by the area of the respective rectangle, not the height. 
Script 1.24: Histogram. py 


import wooldridge as woo 
import matplotlib.pyplot as plt 


ceosall = woo.dataWoo(’ceosall’) 


# extract roe: 
roe = ceosall['roe'] 


# subfigure a (histogram with counts): 
plt.hist(roe, color-'grey') 
plt.ylabel('Counts') 

plt.xlabel('roe') 
plt.savefig('PyGraphs/Histograml.pdf') 
plt.close() 


# subfigure b (histogram with density and explicit breaks): 
breaks - [0, 5, 10, 20, 30, 60] 

plt.hist(roe, color-'grey', bins-breaks, density=True) 
plt.ylabel('density') 

plt.xlabel('roe') 

plt.savefig('PyGraphs/Histogram2.pdf') 


A kernel density plot can be thought of as a more sophisticated version of a histogram. We cannot 
go into detail here, but an intuitive (and oversimplifying) way to think about it is this: We could 
create a histogram bin of a certain width, centered at an arbitrary point of x. We will do this for 
many points and plot these x values against the resulting densities. Here, we will not use this plot 
as an estimator of a population distribution but rather as a pretty alternative to a histogram for the 
descriptive characterization of the sample distribution. For details, see for example Silverman (1986). 

In Python, generating a kernel density plot is straightforward with the module statsmodels: 
nonparametric.KDEUnivariate (x).fit() will automatically choose appropriate parameters 
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Figure 1.11. Histograms 
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of the algorithm given the data and often produce a useful result.!4 Of course, these parameters (like 
the kernel and bandwidth for those who know what that is) can be set manually. 

Script 1.25 (KDensity.py) demonstrates how the result of the density estimation can be plot- 
ted with matplotlib and generates the graphs of Figure 1.12. In Sub-figure (b), a histogram is 
overlayed with a kernel density plot. 


Script 1.25: KDensity.py 


import wooldridg: woo 
import statsmodels.api as sm 
import matplotlib.pyplot as plt 


ceosall = woo.dataWoo(’ceosall’) 


# extract roe: 
roe = ceosall['roe'] 


# estimate kernel density: 
kde - sm.nonparametric.KDEUnivariate (roe) 
kde.fit() 


# subfigure a (kernel density): 
plt.plot(kde.support, kde.density, color-'black', linewidth-2) 
plt .ylabel (' density’) 

plt .xlabel (' roe’ ) 

plt.savefig('PyGraphs/Densityl.pdf') 


# subfigure b (kernel density with overlayed histogram): 
plt.hist(roe, color-'grey', density-True) 
plt.plot(kde.support, kde.density, color-'black', linewidth-2) 
plt.ylabel('density') 

plt .xlabel (’ roe’ ) 

plt.savefig('PyGraphs/Density2.pdf') 


MThe module statsmodels will be introduced in Chapter 2. 


40 1. Introduction 


Figure 1.12. Kernel Density Plots 
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(a) Kernel density (b) Kernel density with histogram 


1.5.3. Empirical Cumulative Distribution Function (ECDF) 


The ECDF is a graph of all values x of a variable against the share of observations with a value less 
than or equal to x. A straightforward way to plot the ECDF for our ROE variable is shown in Script 
1.26 (Descr-ECDF . py) and Figure 1.13. A more automated approach is the use of the statsmodels 
function distributions.empirical distribution.ECDF (x), which would give the same re- 
sult. 

For example, the value of the ECDF for point roe=15 . 5 is 0.5. Half of the sample is less or equal 
to a ROE of 15.5%. In other words: the median ROE is 15.5%. 


Script 1.26: Descr-ECDF.py 


import wooldridge as woo 


import numpy as np 
import matplotlib.pyplot as plt 


ceosall = woo.dataWoo(’ceosall’) 


roe = ceosall['roe'] 


calculate ECDF: 

np.sort (roe) 

x.size 

np.arange(l, n + 1) / n # generates cumulative shares of observations 


5x 
[EN 


4 plot a step function: 

plt.step(x, y, linestyle-'-', color=’black’) 
plt.xlabel('roe') 
plt.savefig('PyGraphs/ecdf.pdf') 


1.5. Descriptive Statistics 


41 


Figure 1.13. Empirical CDF 
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Table 1.5. numpy Functions for Descriptive Statistics 


mean (x) Sample average ¥ = 1Y7 x; 
median (x) Sample median 

var(x, ddof=1) Sample variance s? = =; Y]. 
std(x, ddof=1) Sample standard deviation s 
cov(x, y) Sample covariance Cry = 514 
corrcoef (x, y) Sample correlation rzy = = 


ses 


quantile(x, q) q quantile = 100 - q percentile, e.g. quantile(x, 0.5) - sample median 


(xi - 3» 
v2 


ia (ti — 3) (vi V) 
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1.5.4. Fundamental Statistics 


The functions for calculating the most important descriptive statistics with numpy are listed in Ta- 
ble 1.5. Script 1.27 (Descr-Stats.py) demonstrates this using the CEOSAL1 data set we already 
introduced in Section 1.5.2. 


p — Script 1.27: Descr-Stats.py 
import wooldridge as woo 
import numpy as np 


ceosall = woo.dataWoo ('ceosall') 


# extract roe and salary: 
roe = ceosall['roe'] 
salary = ceosall['salary'] 


# sample average: 
roe mean = np.mean(salary) 
print(f'roe mean: {roe_mean}\n’) 


# sample median: 
roe med = np.median (salary) 
print(f'roe med: (roe med)Wn') 


# standard deviation: 
roe s - np.std(salary, ddof-1) 
print(f'roe s: (roe s)Wn') 


# correlation with ROE: 
roe corr - np.corrcoef(roe, salary) 
print(f'roe corr: \n{roe_corr}\n’) 


p Output of Script 1.27: Descr-Stats.py - 
roe mean: 1281.1196172248804 


roe med: 1039.0 
roe s: 1372.3453079588883 
roe corr: 


(t1. 0.11484173] 
[0.11484173 1. 1] 


A box plot displays the median (the middle line), the upper and lower quartile (the box) and the 
extreme points graphically. Figure 1.14 shows two examples. 50% of the observations are within the 
interval covered by the box, 25% are above and 25% are below. The extreme points are marked by 
the “whiskers” and outliers are printed as separate dots. In matplotlib, box plots are generated 
using the boxplot command. We have to supply one or more data arrays and can alter the design 
flexibly with numerous options as demonstrated in Script 1.28 (Descr-Boxplot.py). 
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Figure 1.14. Box Plots 
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o Script 1.28: Descr-Boxplot.py - 
import wooldridge as woo 
import matplotlib.pyplot as plt 


ceosall = woo.dataWoo('ceosall') 


# extract roe and salary: 
roe = ceosall['roe'] 
consprod = ceosall['consprod'] 


# plotting descriptive statistics: 
plt .boxplot (roe, vert=False) 
plt.ylabel('roe') 
plt.savefig('PyGraphs/Boxplotl.pdf') 
plt.close() 


# plotting descriptive statistics: 
roe cp0 = roe[consprod == 0] 
roe cpl - roe[consprod -- 1] 


plt.boxplot([roe cp0, roe_cp1]) 
plt .ylabel (’ roe’) 
plt.savefig('PyGraphs/Boxplot2.pdf') 


Figure 1.14(a) shows how to get a horizontally aligned plot and Figure 1.14(b) demonstrates how 
to produce multiple boxplots for two sub groups. The variable consprod from the data set ceosa11 
is equal to 1 if the firm is in the consumer product business and 0 otherwise. Apparently, the ROE 
is much higher in this industry. 
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Table 1.6. scipy Functions for Statistical Distributions 


Distribution Param. PMF/PDF CDF Quantile 

Discrete distributions: 

Bernoulli p bernoulli.pmf(x,p) bernoulli.cdf (x, p) bernoulli.ppf(g, p) 
Binomial "p binom.pmf(x,n,p) binom.cdf(x,n,p) binom. ppf (q,n, p) 
Hypergeom.M, n, N hypergeom. pmf (x, M, n, N) hypergeom. cdf (x, M,n, N) hypergeom.pp£ (q, M, n, N) 
Poisson À  poisson.pmf(x,A) poisson.cdf(x,À) poisson.ppf(4,À) 
Geometrie -pn£ (x, p) p _geom.F ] = 


Continuous distribution 

Uniform a,b uniform.pdf(x,a,a +b) uniform. cdf (x,a,a +b) uniform. ppf (q,a,a+b) 
Logistic — logistic. pdf (x) logistic. cd£ (x) logistic.ppf (q) 
Exponential A — expon.pdf(x,scale-1/A) expon.cdf(x,scale=1/A) expon.ppf(q,scale=1/A) 
Std. normal — norm.pdf (x) norm. cdf (x) norm. pp (q) 

Normal 1,U norm. pd€ (x, 1,0) norm. cdf (x, p, 0) norm. ppf (9, 4,0) 
Lognormal m,s lognorm.pdf(q,s,0,m) lognorm.cdf(x,s,0,m) lognorm.ppf(j,s,0,m) 
x n  chi2.pdf(x,n) chi2.cdf(x,n) chi2. pdf (q,") 

t n t.pdf(x,n) t.cdf (x,n) t.pdf (q,n) 

F m,n f.pdf(x,m,n) £.cdf(x,m,n) £.pd£(q,m,n) 


1.6. Probability Distributions 


Appendix B of Wooldridge (2019) introduces the concepts of random variables and their probabil- 
ity distributions.^ The module scipy has many functions for conveniently working with a large 
number of statistical distributions.!© The commands for evaluating the probability density function 
(PDF) for continuous, the probability mass function (PMF) for discrete, and the cumulative distribu- 
tion function (CDF) as well as the quantile function (inverse CDF) for the most relevant distributions 
are shown in Table 1.6. The functions are available after executing 


stats 


[import scipy.stats 


The module documentation defines the relation of a distribution’s set of parameters and the func- 
tion arguments in scipy. We will now briefly discuss each function type. 


1.6.1. Discrete Distributions 


Discrete random variables can only take a finite (or “countably infinite”) set of values. The PMF 
f(x) = P(X = x) gives the probability that a random variable X with this distribution takes the 
given value x. For the most important of those distributions (Bernoulli, Binomial, Hypergeometric", 
Poisson, and Geometric!®), Table 1.6 lists the scipy functions that return the PMF for any value 
x given the parameters of the respective distribution. See the module documentation, if you are 
interested in the formal definitions of the PMFs. 

For a specific example, let X denote the number of white balls we get when drawing with re- 
placement 10 balls from an urn that includes 20% white balls. Then X has the Binomial distribution 


‘The stripped-down textbook for Europe and Africa Wooldridge (2014) does not include this appendix. But the material is 
pretty standard. 

J6scipy is part of the Anaconda distribution and more information about the module is given in Virtanen, Gommers, 
Oliphant, Haberland, Reddy, Cournapeau, Burovski, Peterson, Weckesser, Bright, van der Walt, Brett, Wilson, Jarrod 
Millman, Mayorov, Nelson, Jones, Kern, Larson, Carey, Polat, Feng, Moore, Vand erPlas, Laxalde, Perktold, Cimrman, 
Henriksen, Quintero, Harris, Archibald, Ribeiro, Pedregosa, van Mulbregt, and Contributors (2020). 

17The parameters of the distribution are defined as follows: M is the total number of balls in an urn, n is the total number of 
marked balls in this urn, k is the number of drawn balls and x is number of drawn marked balls. 

18y is the total number of trials, i.e. the number of failures in a sequence of Bernoulli trials before a success occurs plus the 
success trial. 
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with the parameters n = 10 and p = 20% = 0.2. We know that the probability to get exactly 
x € {0,1,...,10} white balls for this distribution is? 


fla) = P(X - 3) = (D) a - pr = (2) 02 os a) 


For example, the probability to get exactly x = 2 white balls is f(2) = (5) - 0.2? -0.8° = 0.302. Of 
course, we can let Python do these calculations using basic Python commands we know from Section 
1.1. More conveniently, we can also use the function binom. pmf for the Binomial distribution: 


Script 1.29: PMF-binom.py 


import scipy.stats as stats 
import math 


# pedestrian approach: 
c = math.factorial(10) / (math.factorial(2) + math.factorial(10 - 2)) 
pl c* (0.2 »** 2) + (0.8 xx 8) 

print(f'pl: (pl)Wn') 


# scipy function: 
p2 = stats.binom.pmf(2, 10, 0.2) 
print(f'p2: (p2)Wn') 


- Output of Script 1.29: PMF-binom.py 
pl: 0.3019898880000002 


p2: 0.30198988799999993 


We can also give arrays as one or more arguments to stats .binom.pmf (x, n, p) and receive the 
results as an array. Script 1.30 (PMF-example.py) evaluates the PMF for our example at all possible 
values for x (0 through 10). It displays a table of the probabilities and creates a bar chart of these 
probabilities which is shown in Figure 1.15(a). As always: feel encouraged to experiment! 


p M Script 1.30: PMF-example.py 
import scipy.stats as stats 
import numpy as np 

import pandas 
import matplo! 


pd 
b.pyplot as plt 


# values for x (all between 0 and 10): 
x = np.linspace(0, 10, num=11) 


# PMF for all these values: 
fx - stats.binom.pmf(x, 10, 0.2) 


# collect values in DataFrame: 
result = pd.DataFrame(('x': x, 'fx': fx)) 
print(f'result: \n{result}\n’) 


# plot: 

plt.bar(x, fx, color-'0.6') 
plt.ylabel('x') 

plt.ylabel('fx') 
plt.savefig('PyGraphs/PMF-example.pdf') 


see Wooldridge (2019, Equation (B.14)) 
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Figure 1.15. Plots of the PMF and PDF 
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x fx 
0 1.073742e-01 
0 2.684355e-01 
0 3.019899e-01 
0 2.013266e-01 
0 8.808038e-02 
0 2.642412e-02 
0 5.505024e-03 
0 7.864320e-04 
0 7.372800e-05 
0 4.096000e-06 
0 1.024000e-07 


Output of Script 1.30: PMF-example.py 
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1.6.2. Continuous Distributions 


For continuous distributions like the uniform, logistic, exponential, normal, f, x5 or F distribution, 
the probability density functions f(x) are also implemented for direct use in scipy. These can for 
example be used to plot the density functions using the plot command (see Section 1.4). Figure 
1.15(b) shows the famous bell-shaped PDF of the standard normal distribution and is created by 
Script 1.31 (PDF-example.py). 


p Script 1.31: PDF-example.py 
import scipy.stats as stats 
import numpy as np 

import matplotlib.pyplot as plt 


# support of normal density: 
x range - np.linspace(-4, 4, num-100) 


# PDF for all these values: 
pdf = stats.norm.pdf (x range) 


# plot: 

plt.plot(x range, pdf, linestyl 
plt.xlabel('x') 
plt.ylabel('dx') 
plt.savefig('PyGraphs/PDF-example.pdf') 


'-', colorz'black') 


1.6.3. Cumulative Distribution Function (CDF) 


For all distributions, the CDF F(x) = P(X < x) represents the probability that the random variable 
X takes a value of at most x. The probability that X is between two values a and b is P(a < X < b) = 
F(b) — F(a). We can directly use the scipy functions in the second column of Table 1.6 to do these 
calculations as demonstrated in Script 1.32 (CDF-example.py). In our example presented above, 
the probability that we get 3 or fewer white balls is F(3) using the appropriate CDF of the Binomial 
distribution. It amounts to 87.9%. The probability that a standard normal random variable takes a 
value between —1.96 and 1.96 is 95%. 


p — —— Script 1.32: CDF-example.py 
import scipy.stats as stats 


# binomial CDF: 
pl = stats.binom.cdf(3, 10, 0.2) 
print(f'pl: (pl)Wn') 


# normal CDF: 
p2 = stats.norm.cdf(1.96) - stats.norm.cdf(-1.96) 
print(f'p2: (p2)Wn') 


E ———————— —— Output of Script 1.32: CDF-example.py 
pl: 0.8791261184000001 


p2: 0.950004209703559 
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Wooldridge, Example B.6: Probabilities for a Normal Random Variable 


We assume X ~ Normal(4,9) and want to calculate P(2 « X < 6) as our first example. We can rewrite 
the problem so it is stated in terms of a standard normal distribution as shown by Wooldridge (2019): 
P(2<X<6)= (3) = 9(-2). We can also spare ourselves the transformation and work with the non- 
standard normal distribution directly. Be careful that the third argument in the scipy commands for the 
normal distribution is not the variance g? = 9 but the standard deviation v = 3. The second example 
calculates P(|X| > 2) = 1— P(X € 2) +P(X < —2). 

FORM gian 


P(X>2) 
Note that we get a slightly different answer in the first example than the one given in Wooldridge (2019) 
since we're working with the exact 3 instead of the rounded .67. 


~ Script 1.33: Example-B-6.py —— ~~~ 
import scipy.stats as stats 


# first example using the transformation: 
Pl 1 = stats.norm.cdf(2 / 3) - stats.norm.cdf(-2 / 3) 
print(f'pl 1: (pl 1)Wn') 


# first example working directly with the distribution of X: 
pl 2 - stats.norm.cdf(6, 4, 3) - stats.norm.cdf(2, 4, 3) 
print(f'pl 2: (pl 2)W') 


# second example: 
p2 = 1 - stats.norm.cdf(2, 4, 3) + stats.norm.cdf(-2, 4, 3) 
print(f'p2: (p2}\n’) 


Output of Script 1.33: Example-B-6.py — 


pl_ 0.4950149249061542 


pl_2: 0.4950149249061542 


p2: 0.7702575944012563 
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Figure 1.16. Plots of the CDF of Discrete and Continuous RV 


024 o2 


(a) Binomial CDF (b) Standard normal CDF 


The graph of the CDF is a step function for discrete distributions. For the urn example, the CDF is 
shown in Figure 1.16(a). The CDF of a continuous distribution is illustrated by the S-shaped CDF of 
the normal distribution as shown in Figure 1.16(b). Both figures are created by the following code: 


Script 1.34: CDF-figure.py 
import scipy.stats stats 
import numpy as np 
import matplotlib.pyplot as plt 


# binomial: 
# support of binomial PMF: 
x binom = np.linspace(-1, 10, num=1000) 


4 PMF for all these values: 
cdf binom = stats.binom.cdf(x binom, 10, 0.2) 


# plot: 

plt.step(x binom, cdf binom, linestyle-'-', color=’black’) 
plt.xlabel('x') 

plt.ylabel('Fx') 
plt.savefig('PyGraphs/CDF-figure-discrete.pdf') 
plt.close() 


# normal: 
# support of normal density: 
x norm - np.linspace(-4, 4, num-1000) 


# PDF for all these values: 
cdf_norm = stats.norm.cdf(x norm) 


# plot: 
plt.plot(x norm, cdf norm, linestyle-'-', color-'black') 
plt.xlabel('x') 

plt.ylabel('Fx') 

plt.savefig('PyGraphs/CDF-figure-cont .pdf') 
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Quantile function 


The q-quantile x[q] of a random variable is the value for which the probability to sample a value 
x < x(q] is just q. These values are important for example for calculating critical values of test 
statistics. 

To give a simple example: Given X is standard normal, the 0.975-quantile is x{0.975] ~ 1.96. So 
the probability to sample a value less or equal to 1.96 is 97.5%: 


Script 1.35: Quantile-example.py 
import scipy.stats as stats 


q 975 = stats.norm.pp£ (0.975) 
print(f'q 975: {q_975}\n’) 


pM —— —— Output of Script 1.35: Quantile-example.py 
q.975: 1.959963984540054 


1.6.4. Random Draws from Probability Distributions 


It is easy to simulate random outcomes by taking a sample from a random variable with a given 
distribution. Strictly speaking, a deterministic machine like a computer can never produce any truly 
random results and we should instead refer to the generated numbers as pseudo-random numbers. 
But for our purpose, it is enough that the generated samples look, feel and behave like true random 
numbers and so we are a little sloppy in our terminology here. For a review of sampling and related 
concepts see Wooldridge (2019, Appendix C.1). 

Before we make heavy use of generating random samples in Section 1.9, we introduce the mechan- 
ics here. Commands in scipy to generate a (pseudo-) random sample are constructed by combining 
the command of the respective distribution (see Table 1.6) and the function name rvs. We could for 
example simulate the result of flipping a fair coin 10 times. We draw a sample of size n = 10 from a 
Bernoulli distribution with parameter p — 1. Each of the 10 generated numbers will take the value 1 
with probability p = } and 0 with probability 1 — p = }. The result behaves the same way as though 
we had actually flipped a coin and translated heads as 1 and tails as 0 (or vice versa). Here is the 
code and a sample generated by it: 


Script 1.36: smpl-bernoulli.py 
import scipy.stats as stats 


sample - stats.bernoulli.rvs(0.5, size-10) 
print(f'sample: {sample}\n’) 


p — — ——— Output of Script 1.36: smpl-bernoulli.py 
sample: [20 010 1 1 0 0 1] 


Translated into the coins, our sample is heads-tails-tails-heads-tails-heads-heads-tails-tails-heads. An 
obvious advantage of doing this in Python rather than with an actual coin is that we can painlessly 
increase the sample size to 1,000 or 10,000,000. Taking draws from the standard normal distribution 
is equally simple: 

Script 1.37: smpl-norm.py 


import scipy.stats as stats 


sample = stats.norm.rvs(size=10) 
print(f'sample: {sample}\n’) 
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~ Output of Script 137: smpl-norm.py —__________ 
sample: [ 2.1652536 0.63260132 0.20412996 -1.94355999 -0.95095503 -0.2650094 
0.46289967 -1.05426978 0.54156159 -0.95774292] 


Working with computer-generated random samples creates problems for the reproducibility of 
the results. If you run the code above, you will get different samples. If we rerun the code, the 
sample will change again. We can solve this problem by making use of how the random numbers 
are actually generated which is, as already noted, not involving true randomness. Actually, we will 
always get the same sequence of numbers if we reset the random number generator to some specific 
state ("seed"). In Python, this is can be done with numpy's function random. seed (number) , where 
number is some arbitrary integer that defines the state but has no other meaning. If we set the seed 
to some arbitrary integer, take a sample, reset the seed to the same state and take another sample, 
both samples will be the same. Also, if I draw a sample with that seed it will be equal to the sample 
you draw if we both start from the same seed. 

Script 1.38 (Random-Numbers . py) demonstrates the workings of random. seed. 

Script 1.38: Random-Numbers .py 


import numpy as np 
import scipy.stats as stats 


# sample from a standard normal RV with sample size n-5: 
samplel = stats.norm.rvs(size=5) 
print (f’samplel: {sample1}\n’) 


# a different sample from the same distribution: 
sample2 = stats.norm.rvs(size=5) 
print(f'sample2: {sample2}\n’) 


d of the random number generator and take two samples: 
(6254137) 

norm. rvs (size=5) 

print (f’sample3: {sample3}\n’) 


sample4 = stat 
print (f' sample 


norm. rvs (size=5) 
{sample4}\n’) 


# reset the seed to the same value to get the same samples again: 
d (6254137) 

norm.rvs (size-5) 

print(f'sample5: {sample5}\n’) 


sample6 = stats.norm.rvs (size-5) 
print(f'sample6: (sample6) n') 


[—————————————— OutputofScript 138: Random-Numbers.py |. 
samplel: [ 0.56038146 -1.41869121 1.74692595 0.4244097  0.67217059] 


sample2: [-1.21348357 2.08717118 -0.4821461 -3.22837683 0.44109069] 
sample3: [ 1.18545933 -0.261977 0.30894761 -2.23354318 0.17612456] 
sample4: [-0.17500741 -1.30835159 0.5036692 0.14991385 0.99957472] 
sample5: [ 1.18545933 -0.261977 0.30894761 -2.23354318 0.17612456] 


sample6: [-0.17500741 -1.30835159 0.5036692 0.14991385 0.99957472] 
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1.7. Confidence Intervals and Statistical Inference 


Wooldridge (2019) provides a concise overview over basic sampling, estimation, and testing. We will 
touch on some of these issues below.”? 


1.7.1. Confidence Intervals 


Confidence intervals (CI) are introduced in Wooldridge (2019, Appendix C.5). They are constructed 
to cover the true population parameter of interest with a given high probability, e.g. 95%. More 
clearly: For 95% of all samples, the implied CI includes the population parameter. 

CI are easy to compute. For a normal population with unknown mean y and variance c?, the 
100(1 — «)% confidence interval for p is given in Wooldridge (2019, Equations C.24 and C.25): 


[7—cy-se(), g+ cg -se(9)| (1.2) 
where 7 is the sample average, se(]) = "i is the standard error of 7 (with s being the sample 
standard deviation of y), n is the sample size and c; the (1— $) quantile of the t„—1 distribution. To 
get the 95% CI (a = 5%), we thus need co 55 which is the 0.975 quantile or 97.5'^ percentile. 

We already know how to calculate all these ingredients. The way of calculating the CI is used in 
the solution to Example C.2. In Section 1.9.3, we will calculate confidence intervals in a simulation 
experiment to help us understand the meaning of confidence intervals. 


2 The stripped-down textbook for Europe and Africa Wooldridge (2014) does not include the discussion of this material. 
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Wooldridge, Example C.2: Effect of Job Training Grants on Worker Productivity 


We are analyzing scrap rates for firms that receive a job training grant in 1988. The scrap rates for 1987 
and 1988 are printed in Wooldridge (2019, Table C.3) and are entered manually in the beginning of 
Script 1.39 (Example-C-2.py). We are interested in the change between the years. The calculation of 
its average as well as the confidence interval are performed precisely as shown above. The resulting CI 
is the same as the one presented in Wooldridge (2019) except for rounding errors we avoid by working 
with the exact numbers. 


Script 139: Example-C-2.py 
import numpy as np 
import scipy.stats as stats 


# manually enter raw data from Wooldridge, Table C.3: 

SR87 = np.array([10, 1, 6, .45, 1.25, 1.3, 1.06, 3, 8.18, 1.67, 
.98, 1, .45, 5.03, 8, 9, 18, .28, 7, 3.97]) 

SR88 - np.array([3, 1, 5, .5, 1.54, 1.5, .8, 2, .67, 1.17, .51, 
.5, .61, 6.7, 4, 7, 19, .2, 5, 3.83]) 


# calculate change: 
Change - SR88 - SR87 


# ingredients to CI formula: 
avgCh = np.mean (Change) 
print(f'avgCh: (avgCh) n') 


n = len(Change) 
sdCh = np.std(Change, ddof-1) 
se - sdCh / np.sqrt(n) 

print (f' (se) Wn) 


c = stats.t.ppf(0.975, n - 1) 
print(f'c: {c}\n’) 


# confidence interval: 
lowerCI = avgCh - c + se 
print(f'lowerCI: {lowerCI}\n’) 


upperCI = avgCh + c + se 
print(f'upperCI: {upperCI}\n’) 


p — — — — Output of Script 1.39: Example-C-2.py 
avgCh: -1.1544999999999999 


se: 0.5367992249386514 
c: 2.093024054408263 
lowerCI: -2.2780336901843095 


upperCI: -0.030966309815690485 
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Wooldridge, Example C.3: Race Discrimination in Hiring 


We are looking into race discrimination using the data set AUDIT. The variable y represents the differ- 
ence in hiring rates between black and white applicants with the identical CV. After calculating the 
average, sample size, standard deviation and the standard error of the sample average, Script 1.40 
(Example-C-3.py) calculates the value for the factor c as the 97.5 percentile of the standard normal 
distribution which is (very close to) 1.96. Finally, the 95% and 99% CI are reported,?! 


p ——— Script 1.40: Example-C-3.py 
import wooldridge as woo 
import numpy as np 

import scipy.stats as stats 


audit = woo.dataWoo (' audit’ ) 
y = audit['y'] 


# ingredients to CI formula: 
avgy = np.mean(y) 

n = len(y| 

sdy = np.std(y, ddof=1) 

se - sdy / np.sqrt(n) 

c95 = stats.norm.ppf(0.975) 
c99 - stats.norm.ppf(0.995) 


# 95% confidence interval 
lowerCI95 - avgy - c95 « 
print(f'lowerCI95: (lowerCI95)Wn') 


upperCI95 = avgy + c95 + 
print(f'upperCI95: (upper 


95)Wn') 
# 99% confidence interval: 
lowerCI99 = avgy - c99 * se 
print(f'lowerCI99: (lowerCI99)Wn') 
upperCI99 = avgy + c99 + se 
print(f'upperCI99: {upperCI99}\n’ ) 


p ——— ———— Output of Script 1.40: Example-C-3.py 
lowerCI95: -0.19363006093502752 


upperCI95: -0.07193010504007621 
lowerCI99: -0.21275050976771243 


upperCI99: -0.052809656207391295 


7I Note that Wooldridge (2019) has a typo in the discussion of this example, therefore the numbers don't quite match for the 
95% CI. 
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1.7.2. t Tests 


Hypothesis tests are covered in Wooldridge (2019, Appendix C.6). The f test statistic for testing a 
hypothesis about the mean y of a normally distributed random variable Y is shown in Equation 
C.35. Given the null hypothesis Ho : y = po, 


— Fo 
s oa 


We already know how to calculate the ingredients from Section 1.7.1 and show to use them to 
perform a f test in Script 1.42 (Example-C-5 .py). We also compare the result to the output of the 
scipy function ttest_1samp, which performs an automated t test. 

The critical value for this test statistic depends on whether the test is one-sided or two-sided. 
The value needed for a two-sided test cs was already calculated for the CI, the other values can be 
generated accordingly. The values for different degrees of freedom n — 1 and significance levels « are 
listed in Wooldridge (2019, Table G.2). Script 1.41 (Critical-Values-t.py) demonstrates how 
we can calculate our own table of critical values for the example of 19 degrees of freedom. 


Script 1.41: Critical-Values-t.py 
import numpy as np 
import pandas as pd 
import scipy.stats as stats 


# degrees of freedom = n-1: 
df = 19 


# significance levels 
alpha_one_tailed = np.array([0.1, 0.05, 0.025, 0.01, 0.005, .001]) 
alpha_two_tailed = alpha_one_tailed + 2 


# critical values & table: 

ppf(1 - alpha one tailed, df) 

pd.DataFrame(('alpha one tailed': alpha one tailed, 
'alpha two tailed': alpha two tailed, ‘CV’: CV)) 

print(f'table: \n{table}\n’) 


— Output of Script 1.41: Critical-Values-t.py — 


table: 

alpha one tailed alpha two tailed cv 
0 0.100 0.200 1.327728 
1 0.050 0.100 1.729133 
2 0.025 0.050 2.093024 
3 0.010 0.020 2.539483 
4 0.005 0.010 2.860935 
5 0.001 0.002 3.579400 
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Wooldridge, Example C.5: Race Discrimination in Hiring 


We continue Example C.3 in Script 1.42 (Examp1e-C-5.py) and perform a one-sided t test of the null 
hypothesis Ho : p = 0 against Hı : p < 0 for the same sample. As the output shows, the t test statistic is 
equal to —4.27. This is much smaller than the negative of the critical value for any sensible significance 
level. Therefore, we reject Ho : u = 0 for this one-sided test, see Wooldridge (2019, Equation C.38). 


p — Script 142: Example-C-5.py 
import wooldridge as woo 
import numpy as np 

import pandas as pd 

import scipy.stats as stats 


audit = woo.dataWoo ('audit') 
y 7 audit['y'] 


# automated calculation of t statistic for HO (mu-0): 
test auto = stats.ttest lsamp(y, popme: 
t auto = test auto.statistic # access test statistic 
p_auto = test auto.pvalue # access two-sided p value 
print(f't auto: (t auto]Wn') 

print(f'p auto/2: (p auto / 2}\n’) 


# manual calculation of t statistic for HO (mu-0): 
avgy 7 np.mean(y) 

n = len(y) 

sdy = np.std(y, ddof=1) 

se = sdy / np.sqrt(n) 

t_manual = avgy / se 

print(f't manual: {t_manual}\n’) 


# critical values for t distribution with n-1-240 d.f.: 

alpha one tailed = np.array([0.1, 0.05, 0.025, 0.01, 0.005, .001]) 

CV = stats.t.ppf(1 - alpha one tailed, 240) 

table - pd.DataFrame(('alpha one tailed': alpha one tailed, 'CV': CV)) 
print(f'table: \n{table}\n’) 


p Output of Script 1.42: Example-C-5.py 
t auto: -4.276816348963646 


p_auto/2: 1.369270781112999e-05 


t manual: -4.276816348963646 


table: 

alpha one tailed cv 
0 0.100 1.285089 
1 0.050 1.651227 
2 0.025 1.969898 
3 0.010 2.341985 
4 0.005 2.596469 
5 0.001 3.124536 
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1.7.3. p Values 


The p value for a test is the probability that (under the assumptions needed to derive the distribution 
of the test statistic) a different random sample would produce the same or an even more extreme 
value of the test statistic"? The advantage of using p values for statistical testing is that they are 
convenient to use. Instead of having to compare the test statistic with critical values which are 
implied by the significance level «, we directly compare p with a. For two-sided f tests, the formula 
for the p value is given in Wooldridge (2019, Equation C.42): 


p-2-P(T, 4» |t) 22- (1—F, ,(It)) , (14) 


where F;, ,(:) is the CDF of the t,,_; distribution which we know how to calculate from Table 1.6. 
Similarly, a one-sided test rejects the null hypothesis only if the value of the estimate is "too high" 
or “too low" relative to the null hypothesis. The p values for these types of tests are: 


irs <t) =F, (t) for Hi: p < po as 


P(T-1 > t) 2 1— Fi, ,(f). for Hi: u > po 


Since we are working on a computer program that knows the CDF of the t distribution, calculating 
p values is straightforward as demonstrated in Script 1.43 (Examp1e-C-6.py). Maybe you noticed 
that the scipy function ttest_1samp in Script 1.42 (Examp1e-C-5 . py) also calculates the p value, 
but be aware that this function is always based on two-sided f tests. 


Z'The p value is often misinterpreted. It is for example not the probability that the null hypothesis is true. For a discussion, 
see for example https://www.nature.com/news/scientific-method-statistical-errors-1.14700. 
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Wooldridge, Example C.6: Effect of Job Training Grants on Worker Productivity 


We continue from Example C.2 in Script 1.43 (Example-C-6.py). We test Hp : y = 0 against H; : p < 0. 
The t statistic is —2.15. The formula for the p value for this one-sided test is given in Wooldridge (2019, 
Equation C.41). As can be seen in the output of Script 1.43 (Examp1e-C-6.py), its value (using exact 
values of t) is around 0.022. If you want to use the scipy function ttest_1samp, you have to divide the 
p value by 2, because we are dealing with a one-sided test. 


Script 1.43: Example-C-6.py - 
import numpy as np 
import scipy.stats as stats 


# manually enter raw data from Wooldridge, Table C.3: 

SR87 - np.array([10, 1, 6, .45, 1.25, 1.3, 1.06, 3, 8.18, 1.67, 
.98, 1, .45, 5.03, 8, 9, 18, .28, 7, 3.97]) 

SR88 = np.array([3, 1, 5, .5, 1.54, 1.5, .8, 2, .67, 1.17, .51, 
.5, .61, 6.7, 4, 7, 19, .2, 5, 3.83]) 

Change - SR88 - SR87 


# automated calculation of t statistic for HO (mu=0): 
test auto = s! tti lsamp (Change, popmean=0) 

t auto - test auto.statistic 

p auto = test auto.pvalue 

print(f't auto: (t auto)Wn') 

print(f'p auto/2: (p auto / 2}\n’) 


# manual calculation of t statistic for HO (mu-0): 
avgCh = np.mean (Change) 

n = len(Change) 

sdCh = np.std(Change, ddof-1) 

se = sdCh / np.sqrt (n) 

t manual - avgCh / se 

print(f't manual: (t manual)Wn') 


# manual calculation of p value for HO (mu-0): 
p manual = stats.t.cdf(t manual, n - 1) 
print(f'p manual: {p_manual}\n’) 


p Output of Script 1.43: Example-C-6.py 
t auto: -2.150711003973493 


p_auto/2: 0.02229062646839212 
t manual: -2.150711003973493 


p.manual: 0.02229062646839212 
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Wooldridge, Example C.7: Race Discrimination in Hiring 


In Example C.5, we found the t statistic for Hp : p = 0 against Hı : p < 0 to be t = —4.276816. The 
corresponding p value is calculated in Script 1.44 (Examp1e-C-7.py). The number 1.369271e-05 is the 
scientific notation for 1.369271 - 10-5 = .00001369271. So the p value is around 0.0014% which is much 
smaller than any reasonable significance level. By construction, we draw the same conclusion as when 
we compare the t statistic with the critical value in Example C.5. We reject the null hypothesis that there 
is no discrimination. 


Script 144: Example-C-7.py 
import wooldridge as woo 
import numpy as np 
import pandas as pd 
import scipy.stats as stats 


audit = woo.dataWoo('audit') 
y = audit['y'] 


# automated calculation of t statistic for HO (mu=0): 
test auto = stats.ttest lsamp(y, popmean-0) 
t auto = test auto.statistic 


print(f't auto: (t auto) Wn') 
print(f'p auto/2: {p_auto/2}\n’) 


# manual calculation of t statistic for HO (mu-0): 
avgy = np.mean(y) 

n = len(y) 

np.std(y, ddof-1) 


print(f't manual: (t manual) Wn') 


# manual calculation of p value for HO (mu-0): 
p.manual = stats.t.cdf(t manual, n - 1) 
print(f'p manual: (p manual)Wn') 


p — — ——— Output of Script 1.44: Example-C-7.py 
t auto: -4.276816348963646 


p_auto/2: 1.369270781112999e-05 


t manual: -4.276816348963646 


p.manual: 1.369270781112999e-05 


60 1. Introduction 


1.8. Advanced Python 


The material covered in this section is not necessary for most of what we will do in the remainder 
of this book, so it can be skipped. However, it is important enough to justify an own section in this 
chapter. We will only scratch the surface, though. For more details, you will have to look somewhere 
else, for example Downey (2015). 


1.8.1. Conditional Execution 


We might want some parts of our code to be executed only under certain conditions. Like most other 
programming languages, this can be achieved with an if else statement. Note that in Python, 
the parts to be conditionally executed are identified by indenting them with the same amount of 
whitespaces. Editors like Spyder will assist us with this. This gives the following syntax: 


if condition: 
expressionl 

else: 
expression2 


The condition has to be a single logical value (True or False). If it is True, then expressionl 
is executed, otherwise expression2 which can also be omitted. A simple example would be 


if p <= 0.05: 
print ("reject H0!") 

else: 
print ("don’t reject H0!") 


Depending on the value of the numeric scalar p, the respective test decision is printed. 


1.8.2. Loops 


For repeatedly executing an expression, different kinds of loops are available. In this book, we will 
use them for Monte Carlo analyses introduced in Section 1.9. For our purposes, the for loop is well 
suited. The correct syntax (including the indenting) is: 


for x in sequence: 
[some commands] 


The loop variable x will take the value of each element of sequence, one after another. For each of 
these elements, [some commands] are executed. Often, sequence will be a list like [1, 2, 3]. 

A nonsense example which combines for loops with an if statement is given in Script 1.45 
(Adv-Loops.py). The reader is encouraged to first form expectations about the output this will 
generate and then compare them with the actual results. 


Script 1.45: Adv-Loops.py 


seq - [1, 2, 3, 4, 5, 6] 
for i in seq: 
if i< 4: 
print (i ** 3) 
else: 
print (i ++ 2) 
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Output of Script 1.45: Adv-Loops.py 
1 
8 
27 
16 
25 


Instead of iterating over a sequence you can also iterate over an index of a sequence and use the 
index to reference other objects. The “pythonian” way of generating such a sequence of indices uses 
the function range, which is demonstrated in Script 1.46 (Adv-Loops2 . py) by doing the same as 
Script 1.45 (Adv-Loops.py). 


Script 146: Adv-Loops2.py 


seq = [1, 2, 3, 4, 5, 6] 
for i in range(len(seq)): 
if segli] < 4: 
print (seq[i] ** 3) 
else: 
print (seq[i] +*+ 2) 


Output of Script 1.46: Adv-Loops2.py 
1 
8 
27 
16 
25 


If you want to execute expressions as long as a given condition is True, Python offers the while 
loop, but we will not present it here. 


1.8.3. Functions 


A function is a block of code that is executed if the function is called. You can provide additional data 
to the function in form of arguments. There are many pre-defined functions and modules provide 
even more functions to expand the capabilities of Python. We're now ready to define our own little 
function. 

The command def newfunc(argl, arg2, ...) defines a new function newfune which ac- 
cepts the arguments arg1, arg2,.... The function definition follows in arbitrarily many lines of 
indented code. Within the function definition, the command return stuff means that stuff is 
to be returned as a result of the function call. For example, we can define the function mysqrt that 
expects one argument internally named x. Script 1.47 (Adv-Funct ions . py) shows how to define 
and call the function mysqrt. 
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Script 1.47: Adv-Functions.py 
# define function: 
def mysqrt (x) : 
if x >= 0: 
result = x ** 0.5 
else: 
result = ‘You fool!’ 
return result 


# call function and save result: 
resultl = mysqrt (4) 
print (f/result1: {result1}\n’) 


result2 = mysqrt(-1.5) 
print(f'result2: (result2)Wn') 


Output of Script 147: Adv-Functions.py 
resultl: 2.0 


result2: You fool! 


Note that you can pass arguments by name, by position, or a combination of both. Passing argu- 
ments by position is used in the examples in Script 1.47 (Adv-Functions . py), because it is clear 
that any provided input to the function must be the argument x. In the case of multiple arguments 
the order of provided inputs matters: the first piece of input is related to the first argument in 
the function definition, the second piece of input is related to the second argument in the function 
definition, etc. . 

As an alternative you could also execute mysqrt (x74), which is meant by providing arguments 
by name. In the case of multiple arguments the order of provided named inputs does not matter. 


1.8.4. Object Orientation 


You might have wondered where all the data types we have used so far (e.g. lists or numpy arrays) 
come from. In an object oriented language like Python almost everything is an object and you can 
easily define your own objects. You can think of an object as an elegant way of structuring your 
code: objects store a certain type of data and contain functions that can be applied to this data. In 
the context of objects, functions are called methods and data are saved in local variables of an object 
(also called attributes). 

To work with objects that are suited for your purposes you have to define what kind of data they 
can store and what you want to do with them. The blueprint of such an object is called a "class". If 
you make use of this class to store data and work with them, you are dealing with an "instance" or 
"object" of this class. Of course, one class can be used to create multiple instances of this class. To use 
local variables or methods of an object, you follow the familiar syntax objectname.variablename 
or objectname.methodname(argl, arg2, ...). 

Let's discuss an easy example: you want to build a database in Python for your local bike shop. 
The first thing you should do is to define a class bike, where you collect properties of a bike. This 
could be the price, size, color or anything else that might be important and you define them as local 
variables. Let's say the color of a bike must often be changed before it can be sold, so you add a 
method changeColor (newColor) to the class definition. The moment the first bike needs to be 
stored in the database, you create an instance of this class, say £irstNewBike. Within this instance, 
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all defined properties are set (also called "initializing"). If a bike with the exact same properties 
arrives a few hours later and needs to be stored in the database, you create a new instance, so every 
object has it's own identity. If you want to change the color of the first instance to green you call 
firstNewBike.changeColor('green'). 

Tn this book, there are only very few cases where we cannot rely on predefined classes provided 
by Python or a given module. However, a basic understanding of object orientation helps you to 
understand how certain commands work. In Script 1.48 (Adv-ObjOr . py), for example, the class 
list is used to create an object named a. The author of this class also added a method count 
which is only applied on data stored within a. There are also methods like sort, which changes 
data stored in an object. 


Script 148: Adv-ObjOr.py 
# use the predefined class ‘list’ to create an object: 
a= I2, 6, 3, 6] 


# access a local variable (to find out what kind of object we are dealing with): 
check = type(a). name . 
print(f'check: {check}\n’) 


# make use of a method (how many 6 are in a?): 
count six = a.count (6) 
print(f'count six: {count_six}\n’) 


# use another method (sort data in a): 
a.sort() 
print(f'a: {a}\n’) 


Output of Script 1.48: Adv-ObjOr.py — 
check: list 


count six: 2 


a: [2, 3, 6, 6] 


We are now ready to define our own class. Script 1.49 (Adv-ObjOr2.py) demonstrates how 
to write your version of the dot method in numpy. Local variables are always initiated by the 
. init method in Python. 

Note that the presented approach of nested loops is not the most computationally efficient way to 
implement matrix multiplication in Python. But it helps to demonstrate the definition of a class and 
gives another example for using £or loops. 
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Script 1.49: Adv-ObjOr2.py 
import numpy as np 


# multiply these two matrices: 
a = np.array([I3, 6, 1], I2, 7, 4]]) 
b = np.array([[1, 8, 6], [3, 5, 8], [1, 1, 2]]) 


# the numpy way: 
result_np = a.dot (b) 
print(f'result np: \n{result_np}\n’) 


4 or, do it yourself by defining a class: 
class myMatrices: 
def init (self, A, B): 
self.A = 
self.B = 
def mult (self): 
N = self.A.shape[0] # number of rows in A 
K = self.B.shape[1] # number of cols in B 
out = np.empty((N, K)) # initialize output 
for i in range(N): 
for j in range(K): 
out[i, j] = sum(self.A[i, :] + self.B[:, j]) 
return out 


# create an object: 
test = myMatrices(a, b) 


4 access local variables: 
print(f'test.A: \n{test.A}\n’) 
print(f'test.B: \n{test.B}\n’) 


# use object method: 
result own - test.mult() 
print(f'result own: \n{result_own}\n’) 


Output of Script 1.49: Adv-ObjOr2.py ————_______ 


result_np: 
[[22 55 68] 
[27 55 761] 


test.A: 
[[3 6 1] 
[2 7 41] 


test.B: 

L[1 8 6] 
[3 5 8] 
[11 21] 


result own: 
[[22. 55. 68.] 
[27. 55. 76,11 
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You can easily build on other classes by using a concept called inheritance. Let's assume we want 
to extend our class myMatrices by a method that calculates the total amount of elements in the 
matrix product. Subclass myMatNew in Script 1.50 (Adv-ObjOr3.py) inherits the properties and 
methods from myMatrices and adds the method get TotalElem, so by using myMatNew you can 
do everything you can do with myMatrices and calculating the total amount of elements in the 
matrix product. 


Script 1.50: Adv-ObjOr3.py 
import numpy as np 


# multiply these two matrices: 
a = np.array([[3, 6, 1], [2, 7, 411) 
b = np.array([[1, 8, 6], [3, 5, 8], [1, 1, 211) 


# define your own class: 
class myMatrices: 
def init (self, A, B): 


1f.A.shape[0] # number of rows in A 
1£.B.shape[1] # number of cols in B 
out = np.empty((N, K)) # initialize output 
for i in range(N): 
for j in range(K): 
out[i, j] = sum(self.A[i, :] * self.B[:, j]) 
return out 


# define a subcla: 
class myMatNew (myMatrices) : 
def getTotalElem(self): 
N = self.A.shape[0] # number of rows in A 
K = self.B.shape[1] # number of cols in B 
return N + K 


# create an object of the subclass: 
test = myMatNew(a, b) 


# use a method of myMatrices: 
result own = test.mult () 
print(f'result own: \n{result_own}\n’) 


# use a method of myMatNew: 
totalElem - test.getTotalElem() 
print(f'totalElem: (totalElem)n') 


Output of Script 1.50: Adv-ObjOr3.py 
result own: 
[[22. 55. 68.] 
[27. 55. 76.11 


totalElem: 6 
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Be aware that we only covered the most important concepts of object orientated programming that 
we will encounter in this book. 


1.8.5. Outlook 


While this section is called “Advanced Python”, we have admittedly only scratched the surface of 
semi-advanced topics. One topic we defer to Chapter 19 is how Python can automatically create 
formatted reports and publication-ready documents. 

Another advanced topic is the optimization of computational speed. So an example of seriously 
advanced topics for the real Python geek is to use parallel computing to speed up computations. 

Since real Python geeks are not the target audience of this book, we will stop to even mention more 
intimidating possibilities and focus on implementing the most important econometric methods in the 
most straightforward and pragmatic way. 


1.9. Monte Carlo Simulation 


Appendix C.2 of Wooldridge (2019) contains a brief introduction to estimators and their properties? 
In real-world applications, we typically have a data set corresponding to a random sample from a 
well-defined population. We don’t know the population parameters and use the sample to estimate 
them. 

When we generate a sample using a computer program as we have introduced in Section 1.6.4, we 
know the population parameters since we had to choose them when making the random draws. We 
could apply the same estimators to this artificial sample to estimate the population parameters. The 
tasks would be: (1) Select a population distribution and its parameters. (2) Generate a sample from 
this distribution. (3) Use the sample to estimate the population parameters. 

If this sounds a little insane to you: Don’t worry, that would be a healthy first reaction. We obtain 
a noisy estimate of something we know precisely. But this sort of analysis does in fact make sense. 
Because we estimate something we actually know, we are able to study the behavior of our estimator 
very well. 

In this book, we mainly use this approach for illustrative and didactic reasons. In state-of-the-art 
research, it is widely used since it often provides the only way to learn about important features of 
estimators and statistical tests. A name frequently given to these sorts of analyses is Monte Carlo 
simulation in reference to the “gambling” involved in generating random samples. 


1.9.1. Finite Sample Properties of Estimators 


Let's look at a simple example and simulate a situation in which we want to estimate the mean p of 
a normally distributed random variable 


Y ~ Normal(y, c?) (1.6) 


using a sample of a given size n. The obvious estimator for the population mean would be the 
sample average Y. But what properties does this estimator have? The informed reader immediately 
knows that the sampling distribution of Y is 


Y ~ Normal (r 2) a7 


Simulation provides a way to verify this claim. 


2 The stripped-down textbook for Europe and Africa Wooldridge (2014) does not include this either. 
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Script 1.51 (Simulate-Est imate. py) shows a simulation experiment in action: We set the seed 
to ensure reproducibility and draw a sample of size n = 100 from the population distribution (with 
the population parameters y = 10 and g = 2).™ Then, we calculate the sample average as an estimate 
of u. We see results for three different samples. 

Script 1.51: Simulate-Estimate.py 
import numpy as np 
import scipy.stats as stats 


# set the random seed: 
np. random. seed (123456) 


# set sample size: 
n - 100 


# draw a sample given the population parameters: 
samplel = stats.norm.rvs(10, 2, size=n) 


# estimate the population mean with the sample average: 
estimatel - np.mean(samplel) 
print(f'estimatel: (estimatel)n') 


# draw a different sample and estimate again: 
tat jorm.rvs(10, 2, size-n) 

np.mean (sample2) 
print(f'estimate2: {estimate2}\n’) 


# draw a third sample and estimate again: 
sample3 - stats.norm.rvs(10, 2, size-n) 
timate3 = np iample3) 
print(f'estimate3: {estimate3}\n’) 


Output of Script 1.51: Simulate-Estimate.py 
estimatel: 9.573602656614304 


estimate2: 10.24798129790092 


estimate3: 9.96021755398913 


All sample means Y are around the true mean p = 10 which is consistent with our presumption 
formulated in Equation 1.7. It is also not surprising that we don't get the exact population parameter 
— that's the nature of the sampling noise. According to Equation 1.7, the results are expected to have 
a variance of [= = 0.04. Three samples of this kind are insufficient to draw strong conclusions 
regarding the validity of Equation 1.7. Good Monte Carlo simulation studies should use as many 
samples as possible. 

In Section 1.8.2, we introduced for loops. While they are not the most powerful technique avail- 
able in Python to implement a Monte Carlo study, we will stick to them since they are quite trans- 
parent and straightforward. The code shown in Script 1.52 (Simulation-Repeated.py) uses a 
for loop to draw 10000 samples of size n = 100 and calculates the sample average for all of them. 
After setting the random seed, the empty array ybar of size 10000 is initialized using the np . empty 
command. We will replace these empty array values with the estimates one after another in the loop. 
In each of these replications j = 0, 1,2,...,9999, a sample is drawn, its average calculated and stored 


24See Section 1.6.4 for the basics of random number generation. 
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in position number j of ybar. In this way, we end up with a list of 10000 estimates from different 
samples. The Script Simulat ion-Repeated. py does not generate any output. 


Script 1.52: Simulation-Repeated.py 
import numpy as np 
import scipy.stats as stats 


# set the random seed: 
np. random. seed (123456) 


# set sample size: 
n = 100 


# initialize ybar to an array of length r=10000 to later store results: 
r = 10000 
ybar = np.empty (r) 


# repeat r times: 

for j in range(r): 
# draw a sample and store the sample mean in pos. j-0,1,... of ybar: 
sample = stats.norm.rvs(10, 2, size-n) 
ybar[j] = np.mean (sample) 


Script 1.53 (Simulation-Repeated-Results.py) analyses these 10000 estimates. Here, we 
just discuss the output, but you find the complete code in the appendix. The average of ybar is very 
close to the presumption u = 10 from Equation 1.7. Also the simulated sampling variance is close 
to the theoretical result [A = 0.04. Note that the degrees of freedom are adjusted with ddof=1 in 
np.var() to compute the unbiased estimate of the variance. Finally, the estimated density (using 
a kernel density estimate from the module statsmodels) is compared to the theoretical normal 
distribution. The result is shown in Figure 1.17. The two lines are almost indistinguishable except 
for the area close to the mode (where the kernel density estimator is known to have problems). 


m~~ Output of Script 1.53: Simulation-Repeated-Results.py 
ybar [0:19]: 

[ 9.57360266 10.2479813 9.96021755 9.67635967 29.82261605 9.6270579 
10.02979223 10.15400282 10.28812728 9.69935763 10.41950951 10.07993562 
9.75764232 10.10504699 9.99813607 9.92113688 9.55713599 10.01404669 


10.25550724] 
np.mean(ybar): 10.00082418067469 


np.var(ybar, ddof-1): 0.03989666893894718 


To summarize, the simulation results confirm the theoretical results in Equation 1.7. Mean, vari- 
ance and density are very close and it seems likely that the remaining tiny differences are due to the 
fact that we "only" used 10000 samples. 

Remember: for most advanced estimators, such simulations are the only way to study some of 
their features since it is impossible to derive theoretical results of interest. For us, the simple exam- 
ple hopefully clarified the approach of Monte Carlo simulations and the meaning of the sampling 
distribution and prepared us for other interesting simulation exercises. 
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Figure 1.17. Simulated and Theoretical Density of Y 
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1.9.2. Asymptotic Properties of Estimators 


Asymptotic analyses are concerned with large samples and with the behavior of estimators and 
other statistics as the sample size n increases without bound. For a discussion of these topics, see 
Wooldridge (2019, Appendix C.3). According to the law of large numbers, the sample average Y in 
the above example converges in probability to the population mean y as n — oo. In (infinitely) large 
samples, this implies that E(Y) — j and Var(Y) — 0. 

With Monte Carlo simulation, we have a tool to see how this works out in our exam- 
ple. We just have to change the sample size in the code line n = 100 in Script 1.52 
(Simulation-Repeated.py) to a different number and rerun the simulation code. Results 
for n — 10,50,100, and 1000 are presented in Figure 1.18. Apparently, the variance of Y does in fact 
decrease. The graph of the density for n = 1000 is already very narrow and high indicating a small 
variance. Of course, we cannot actually increase 7 to infinity without crashing our computer, but 
it appears plausible that the density will eventually collapse into one vertical line corresponding to 
Var(Y) + 0 as n — oo. 

In our example for the simulations, the random variable Y was normally distributed, therefore 
the sample average Y was also normal for any sample size. This can also be confirmed in Figure 
1.18 where the respective normal densities were added to the graphs as dashed lines. The central 
limit theorem (CLT) claims that as n — co, the sample mean Y of a random sample will eventually 
always be normally distributed, no matter what the distribution of Y is (unless it is very weird with 
an infinite variance). This is called convergence in distribution. 

Let's check this with a very non-normal distribution, the x? distribution with one degree of 
freedom. Its density is depicted in Figure 1.19.5 It looks very different from our familiar 
bell-shaped normal density. The only line we have to change in the simulation code in Script 
1.52 (Simulation-Repeated.py) is sample = stats.norm.rvs(10, 2, size-n) which we 


35A motivated reader will already have figured out that this graph was generated by chi2.pdf (x, df) from the scipy 
module. 


70 1. Introduction 


Figure 1.18. Density of Y with Different Sample Sizes 
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have to replace with sample = stats.chi2.rvs(1, size-n) according to Table 1.6. Figure 
1.20 shows the simulated densities for different sample sizes and compares them to the normal dis- 
tribution with the same mean p = 1 and standard deviation B = "P Note that the scales of the 
axes now differ between the sub-figures in order to provide a better impression of the shape of the 
densities. The effect of a decreasing variance works here in exactly the same way as with the normal 
population. 

Not surprisingly, the distribution of Y is very different from a normal one in small samples like 
n — 2. With increasing sample size, the CLT works its magic and the distribution gets closer to the 
normal bell-shape. For n = 10000, the densities hardly differ at all so it’s easy to imagine that they 
will eventually be the same as n — oo. 


1.9.3. Simulation of Confidence Intervals and : Tests 


In addition to repeatedly estimating population parameters, we can also calculate confidence inter- 
vals and conduct tests on the simulated samples. Here, we present a somewhat advanced simulation 
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Figure 1.19. Density of the x? Distribution with 1 d.f. 
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Figure 1.20. Density of Y with Different Sample Sizes: 
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routine. The payoff of going through this material is that it might substantially improve our under- 
standing of the workings of statistical inference. 

We start from the same example as in Section 1.9.1: In the population, Y ~ Normal(10,4). We 
draw 10000 samples of size n = 100 from this population. For each of the samples we calculate 

* The 95% confidence interval and store the limits in CI1ower and CIupper. 
* The p value for the two-sided test of the correct null hypothesis Ho : p = 10 > array pvaluel 
* The p value for the two-sided test of the incorrect null hypothesis Ho : y = 9.5 = array pvalue2 

Finally, we calculate the array reject1 and reject2 with logical items that are True if we 
reject the respective null hypothesis at a = 5%, i.e. if pvaluel or pvalue2 are smaller than 0.05, 
respectively. Script 1.55 (Simulat ion—Inference. py) shows the Python code for these simulations 
and a frequency table for the results reject1 and reject2. 

If theory and the implementation in Python are accurate, the probability to reject a correct null 
hypothesis (i.e. to make a Type I error) should be equal to the chosen significance level a. In our 
simulation, we reject the correct hypothesis in 504 of the 10000 samples, which amounts to 5.04%. 

The probability to reject a false hypothesis is called the power of a test. It depends on many things 
like the sample size and “how bad” the error of Ho is, i.e. how far away jio is from the true ji. Theory 
just tells us that the power is larger than a. In our simulation, the wrong null Ho : u = 9.5 is rejected 
in 69.9% of the samples. The reader is strongly encouraged to tinker with the simulation code to 
verify the theoretical results that this power increases if pọ moves away from 10 and if the sample 
size n increases. 

Figure 1.21 graphically presents the 95% CI for the first 100 simulated samples.”° Each horizontal 
line represents one CI. In these first 100 samples, the true null was rejected in 4 cases. This fact 
means that for those four samples the CI does not cover jj = 10, see Wooldridge (2019, Appendix 
C.6) on the relationship between CI and tests. These four cases are drawn in black in the left part of 
the figure, whereas the others are gray. 

The t-test rejects the false null hypothesis Ho : p = 9.5 in 72 of the first 100 samples. Their CIs do 
not cover 9.5 and are drawn in black in the right part of Figure 1.21. 


%For the sake of completeness, the code for generating these graphs is shown in Appendix IV, Script 1.54 
(Simulat ion-Inference-Figure . py), but most readers will probably not find it important to look at it at this point. 
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Script 1.55: Simulation-Inference.py 
import numpy as np 
import scipy.stats as stats 


# set the random seed: 
np. random. seed (123456) 


# set sample size and MC simulations: 
x = 10000 
n= 100 


# initialize arrays to later store results: 
CIlower = np.empty (r) 
CIupper = np.empty(r) 
pvaluel = np.empty (r) 
pvalue2 = np.empty (r) 


# repeat r times: 

for j in range(r): 
# draw a sample: 
sample - stats.norm.rvs(10, 2, size-n) 
sample mean - np.mean(sample) 

d - np.std(sample, ddof-1) 


the (correct) null hypothesis mu-10: 
—lsamp(sample, popmean-10) 
1.pvalue 


pvaluel[j] 
cv = stats.t.ppf(0.975, df: 


-1) 
sample mean - cv * sample sd / np.sqrt (n) 
sample mean + cv + sample sd / np. sqrt (n) 


CIlower[j] 
CIupper[j] 


# test the (incorrect) null hypothesis mu=9.5 & store the p value: 
testres2 - stats.ttest lsamp(sample, popmean-9.5) 
pvalue2[j] = testres2.pvalue 


# test results as logical value: 

rejectl - pvaluel «- 0.05 

countl true = np.count nonzero(rejectl) # counts true 
countl false - r - countl true 

print(f'countl true: (countl true)Wn') 

print(f'countl false: (countl false)in') 


reject2 - pvalue2 «- 0.05 
count2 true = np.count nonzero (reject2) 
count2 false - r - count2 true 
print(f'count2 true: {count2_true}\n’) 
print(f'count2 false: {count2_false}\n’) 


[— — — — ——— Output of Script 1.55: Simulation-Inference.py 
countl true: 504 


countl false: 9496 


count2 true: 6990 


count2 false: 3010 
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Figure 1.21. Simulation Results: First 100 Confidence Intervals 
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Part I. 


Regression Analysis with 
Cross-Sectional Data 


2. The Simple Regression Model 


2.1. Simple OLS Regression 


We are concerned with estimating the population parameters By and f of the simple linear regres- 
sion model 

y= Bo+Bix+u (2.1) 
from a random sample of y and x. According to Wooldridge (2019, Section 2.2), the ordinary least 
squares (OLS) estimators are 


Bo = g-hs (2.2) 
_ Cov(x,y) 
B = Var(x) ` (23) 


Based on these estimated parameters, the OLS regression line is 
9 = ĝo + Bix. (2.4) 


For a given sample, we just need to calculate the four statistics 7, z, Cov(x, y), and Var(x) and plug 
them into these equations. We already know how to make these calculations in Python, see Section 
1.5. Let's do it! 


Wooldridge, Example 2.3: CEO Salary and Return on Equity 


We are using the data set cEosAL1 we already analyzed in Section 1.5. We consider the simple regres- 
sion model 
salary = fo--Biroe +u 


where salary is the salary of a CEO in thousand dollars and roe is the return on investment in percent. 
In Script 2.1 (Examp1e-2-3.py), we first load the modules and the data set. We also calculate the four 
statistics we need for Equations 2.2 and 2.3 so we can reproduce the OLS formulas by hand. Finally, the 
parameter estimates are calculated. 
So the OLS regression line is 

salary = 963.1913 + 18.50119 - roe. 
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Script 2.1: Example-2-3.py 


import wooldridge as woo 
import numpy as np 


ceosall = woo.dataWoo ('ceosall') 
x = ceosall['roe'] 
y 7 ceosall['salary'] 


# ingredients to the OLS formulas: 
cov xy = np.cov(x, y)[1, 0] # access 2. row and 1. column of covariance matrix 
var_x = np.var(x, ddof=1) 

x_bar = np.mean (x) 

y_bar = np.mean(y) 


# manual calculation of OLS coefficients: 
bl = cov_xy / var_x 

b0 = y bar - bl + x bar 

print(f'bl: (bl)in') 

print (f'b0: {b0}\n’) 


M — — — Output of Script 2.1: Example-2-3.py 
bl: 18.501186345214926 


b0: 963.1913364725579 


While calculating OLS coefficients using this pedestrian approach is straightforward, there is a 
more convenient way to do it. Given the importance of OLS regression, it is not surprising that 
many Python modules have a specialized command to do the calculations automatically. In the 
following chapters, we will often use the module statsmodels to apply linear regression and other 
econometric methods. More information about the module is provided by Seabold and Perktold 
(2010). When working with statsmodels, the first line of code often is: 


[import statsmodels.formula.api as smf ] 


If the data frame sample contains the values of the dependent variable in column y and those of 
the regressor in the column x, we can calculate the OLS coefficients as 


reg = smf.ols(formula-'y ~ x’, data-sample) 
results = reg.fit () 


The first argument y ~ x is called a formula. Essentially, it means that we want to model a 
left-hand-side variable y to be explained by a right-hand-side variable x in a linear fashion. We will 
discuss more general model formulae in Section 6.1. In the second line of code, the actual calculation 
of OLS coefficients and many other results are performed by calling the method fit. 

Finally, all kind of results are assigned to the variable results. The name could of course be 
anything, for example yummy_chocolate_chip_cookies, but choosing telling variable names 
makes our life easier. As already mentioned, the referenced object does not only include the OLS 
coefficients, but also information on the data source and much more we will get to know and use 
later on. 


The module statsmodels is part of the Anaconda distribution. 
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Wooldridge, Example 2.3: CEO Salary and Return on Equity (cont'ed) T 


In Script 2.2 (Examp1e-2-3-2.py), we repeat the analysis we have already done manually. Besides 
the import of the data, there are only a few lines of code. The output shows how to access both 
estimated parameters with results .params: Boi is labeled Intercept and Bi is labeled with the name 
of the explanatory variable roe. The values are the same we already calculated except for different 
rounding in the output. 


m Script 22: Example-2-3-2.py 
import wooldridge as woo 
import statsmodels.formula.api as smf 


ceosall = woo.dataWoo('ceosall') 


reg = smf.ols(formula-'salary ~ roe’, data-ceosall) 
results = reg. fit() 

b = results.params 

print (f’b: \n{b}\n’) 


Output of Script 2.2: Example-2-3-2.py 


b: 
Intercept 963.191336 
roe 18.501186 


dtype: floated 


From now on, we will rely on the built-in routine in statsmodels instead of doing the calcula- 
tions manually. It is not only more convenient for calculating the coefficients, but also for further 
analyses as we will see soon. 

Given the results from a regression, plotting the regression line is straightforward. As we have 
already seen in Section 1.4.3, the command plot can add points to a graph. In this case, we simply 
supply the regressor roe and the predicted values (available under results.fittedvalues)and 
connect them by a line. 


Wooldridge, Example 2.3: CEO Salary and Return on Equity (cont’ed) 


Script 2.3 (Examp1e-2-3-3.py) demonstrates how to store the regression results in a variable results 
and then use its included fitted values as an argument to plot to add the regression line to the scatter 
plot. It generates Figure 2.1. 
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M — — —— Script 2.3: Example-2-3-3.py 
import wooldridge as woo 
import statsmodels.formula.api as smf 
import matplotlib.pyplot as plt 


ceosall = woo.dataWoo ('ceosall') 


# OLS regression: 
reg = smf.ols(formula=’ salary ~ roe’, data-ceosall) 
results = reg.fit() 


# scatter plot and fitted values: 

plt.plot (/roe’, ‘salary’, data-ceosall, color-'grey', marker-'o', linestyle-'') 
plt.plot(ceosall['roe'], results.fittedvalues, color-'black', linestyle-'-') 
plt.ylabel('salary') 

plt.xlabel('roe') 

plt.savefig('PyGraphs/Example-2-3-3.pdf') 


Figure 2.1. OLS Regression Line for Example 2-3 
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Wooldridge, Example 2.4: Wage and Education 


We are using the data set WAGE1. We are interested in studying the relation between education and 
wage, and our regression model is 


wage = Bo + Byeducation +u. 
In Script 2.4 (Examp1e-2-4.py), we analyze the data and find that the OLS regression line is 
wage = —0.90 + 0.54 - education. 


One additional year of education is associated with an increase of the typical wage by about 54 cents 
an hour. 
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Script 2.4: Example-2-4.py 
import wooldridge as woo 
import statsmodels.formula.api as smf 


wagel = woo.dataWoo ('wagel') 


reg = smf.ols(formula-'wage ~ educ’, data-wagel) 
results = reg.fit() 

b = results.params 

print(f'b: \n{b}\n’) 


Output of Script 2.4: Example-2-4.py 


b: 
Intercept -0.904852 
educ 0.541359 


dtype: float64 


Wooldridge, Example 2.5: Voting Outcomes and Campaign Expenditures 


The data set VOTE1 contains information on campaign expenditures (shareA = share of campaign 
spending in %) and election outcomes (vot ea = share of vote in %). The regression model 


voteA = By + fıshareA +u 
is estimated in Script 2.5 (Examp1e-2-5.py). The OLS regression line turns out to be 
voteA = 26.81 + 0.464 - sharea. 


The scatter plot with the regression line generated in the code is shown in Figure 2.2. 


Script 2.5: Example-2-5.py 


import wooldridge as woo 
import statsmodels.formula.api as smf 
import matplotlib.pyplot as plt 


votel = woo.dataWoo('votel') 


# OLS regression: 
reg = smf.ols(formul 
results - reg.fit() 
b - results.params 
print(f'b: \n{b}\n’) 


'voteA ~ shareA', data-votel) 


# scatter plot and fitted values: 

plt.plot('shareA', 'voteA', data-votel, color-'grey', marker-'o', linestyle-'') 
plt.plot(votel['shareA'], results.fittedvalues, color-'black', linestyle-'-') 
plt.ylabel('voteA') 

plt.xlabel.('shareA') 

plt.savefig('PyGraphs/Example-2-5.pdf') 


Output of Script 2.5: Example-2-5.py 
b: 
Intercept 26.812214 
shareA 0.463827 
dtype: float64 
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Figure 2.2. OLS Regression Line for Example 2-5 
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2.2. Coefficients, Fitted Values, and Residuals 


The object returned by the method £it contains all relevant information on the regression. Since this 
information is distributed across multiple object local variables of the returned object, we can access 
them with the syntax resultobject.local var name. After defining the regression results 


— — | 


The coefficient object has names attached to its elements. The name of the intercept parameter ĝo 
is Intercept and the name of the slope parameter f; is the variable name of the regressor x. In 
this way, we can access the parameters separately by using either the position (0 or 1) or the name as 
an index to the coefficients object. For example, in Script 2.2 (Example-2-3-2.py) you can access 
intercept and slope parameter by 


b[0] # intercept 
b['roe'] # slope parameter 


Given these parameter estimates, calculating the predicted values j; and residuals ñ; for each 
observation i = 1,...,n is easy: 


Ji = Bot Br-x Q5) 
ûi = yi—fi (2.6) 
If the values of the dependent and independent variables are stored in a data frame sample as 


y and x, respectively, we can estimate the model and do the calculations of these equations for all 
observations jointly using the code 


reg - smf.ols(formula-'y - x', data-sample) 
results - reg.fit() 

b = results.params 

y_hat = b[0] + b[1] + sample['x'] 

u_hat = sample[’y’] - y hat 
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We can also use a more black-box approach which will give exactly the same results using the 
precalculated variables £ittedvalues and resid on the regression results object: 


reg = smf.ols(formula-'y ~ x’, data=sample) 
results = reg. fit () 

y-hat = results. fittedvalues 

u_hat = results.resid 


Wooldridge, Example 2.6: CEO Salary and Return on Equity 


We extend the regression example on the return on equity of a firm and the salary of its CEO in Script 
2.6 (Example-2-6.py). After the OLS regression, we calculate fitted values and residuals. A table similar 
to Wooldridge (2019, Table 2.2) is generated displaying the values for the first 15 observations. 


—— — - Script 2.6: Example-2-6.py — 
import wooldridge as woo 

import pandas as pd 

import statsmodels.formula.api as smf 


ceosall = woo.dataWoo('ce: 11’) 


# OLS regression: 
reg = smf.ols(formula-'salary ~ roe’, data-ceosall) 
results = reg. fit() 


# obtain predicted values and residuals: 
salary hat = results.fittedvalues 
u_hat = results.resid 


# Wooldridge, Table 2.2: 

table = pd.DataFrame({’roe’: ceosall['roe'], 
‘salary’: ceosall['s 

salary hat 
/u hat': u_hat}) 

print(f'table.head(15): \n{table.head(15) }\n’) 


—— Output of Script 2.6: Example-2-6.py — 
table.head(15): 

roe salary salary hat u hat 
0 14.100000 1095 1224.058071 -129.058071 
1 10.900000 1001 1164.854261 -163.854261 
2 23.500000 1122 1397.969216 -275.969216 
3 5.900000 578 1072.348338 -494.348338 
4 13.800000 1368 1218.507712 149.492288 
5 20.000000 1145 1333.215063 -188.215063 
6 
7 
8 


16.400000 1078 1266.610785 -188.610785 
16.299999 1094 1264.760660 -170.760660 
10.500000 1237 1157.453793 79.546207 
9 26.299999 833 1449.772523 -616.772523 
10 25.900000 567 1442.372056 -875.372056 
11 26.799999 933 1459.023116 -526.023116 
12 14.800000 1339 1237.008898 101.991102 
13 22.299999 937 1375.767778 -438.767778 


14 56.299999 2011 2004.808114 6.191886 
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Wooldridge (2019, Section 2.3) presents and discusses three properties of OLS statistics which we 
will confirm for an example. 


Ya,=0 > dü-0 (2.7) 
i=1 
| xij =0 - E e =0 (2.8) 
i=1 
g= ĝo + ĝi (29) 


Wooldridge, Example 2.7: Wage and Education 


We already know the regression results when we regress wage on education from Example 2.4. In 
Script 2.7 (Examp1e-2-7.py), we calculate fitted values and residuals to confirm the three properties 
from Equations 2.7 through 2.9. Note that Python does all calculations in “double precision” imply- 
ing that it is accurate for at least 15 significant digits. The output that checks the first property shows 
that the average residual is -7.564713e-15 which in scientific notation means —7.564713 10-15. = 
—0.000000000000007564713. The reason it is not exactly equal to 0 is a rounding error in the 16" digit. 
The same holds for the second property: The covariance between the regressor and the residual is zero 
except for minimal rounding error. Note that running Script 2.7 (£xamp1e-2-7 . py) will give you the same 
accurate digits, but the digits with rounding error will differ. The third property is also confirmed: If we 
plug the average value of the regressor into the regression line formula, we get the average value of 
the dependent variable. 


o Script 2.7: Example-2-7.py |... 
import wooldridge 
import numpy as np 
import statsmodels.formula.api as smf 


wagel = woo.dataWoo ('wagel') 
reg = smf.ols(formula-'wage ~ educ’, data-wagel) 
results = reg.fit() 


# obtain coefficients, predicted values and residuals: 
b = results.params 

wage hat - results.fittedvalues 

u hat = results.resid 


# confirm property (1): 
u hat mean = np.mean(u hat) 
print(f'u hat mean: (u hat mean)Vn') 


# confirm property (2): 
educ u cov = np.cov(wagel[’educ’], u hat)[1, 0] 
print(f'educ u cov: (educ u cov) n') 


# confirm property (3): 

educ mean = np.mean (wagel[' educ’ ]) 
wage pred - b[0] * b[1] * educ mean 
print(f'wage pred: (wage pred) n') 


wage mean = np.mean (wagel [’ wage’ ]) 
print(f'wage mean: {wage_mean}\n’) 
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—______________ Output of Script 2.7: Example-2-7.py 
u hat mean: -7.564713536609432e-15 


educ u cov: -2.3211062701496606e-15 
wage pred: 5.896102674787043 


wage mean: 5.896102674787035 


2.3. Goodness of Fit 


The total sum of squares (SST), explained sum of squares (SSE) and residual sum of squares (SSR) 
can be written as 

SST = Ya (yi — y? = (n — 1) - Var(y) (2.10) 
i-i (9i — 9)? = (n — 1) - Var(9) (2.1) 
SSR = E? 4 (8; — 0)? = (n — 1) - Var(2) (2.12) 


where Var(x) is the sample variance zh D(x; - xy. 
Wooldridge (2019, Equation 2.38) defines the coefficient of determination in terms of these terms. 
Because (n — 1) cancels out, it can be equivalently written as 


. War) _ Var(û) 
R= Var(y) ^ 1 Var(y) (2.13) 


Wooldridge, Example 2.8: CEO Salary and Return on Equity 


In the regression already studied in Example 2.6, the coefficient of determination is 0.0132. This is calcu- 
lated in the two ways of Equation 2.13 in Script 2.8 (Examp1e-2-8.py). In addition, it is calculated as the 
squared correlation coefficient of y and 9. Not surprisingly, all versions of these calculations produce the 
same result (they are not exactly equal to each other because of the rounding error in the 16h digit). 
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Script 2.8: Example-2-8.py 


import wooldridge as woo 


import numpy as np 
import statsmodels.formula.api as smf 


ceosall = woo.dataWoo ('ceosall') 


# OLS regression: 
reg = smf.ols(formula=’ salary ~ roe’, data-ceosall) 
results = reg. fit () 


# calculate predicted values & residuals: 
sal_hat = results.fittedvalues 
u_hat = results.resid 


# calculate R*2 in three different ways: 

sal = ceosall['salary'] 

R2_a = np.var(sal_hat, ddof=1) / np.var(sal, ddof=1) 
R2 b = 1 - np.var(u hat, ddof-1) / np.var(sal, ddof=1) 
R2 c = np.corrcoef(sal, sal hat)[1, 0] «« 2 


print(f'R2 a: (R2 a)Wn') 


print (£/R2_b: (R2 b)Wn') 
print(f'R2 c: (R2 c)Wn') 


m~~ Output of Script 2.8: Example-2-8.py 
R2 a: 0.013188624081034115 


R2 b: 0.01318862408103405 


R2 c: 0.013188624081034089 


Many interesting results for a regression can be displayed by calling the method summary. 
You call this method on the object returned by the method fit as demonstrated in Script 2.9 
(Example-2-9.py). The output will display 

* A block of general information about the regression model. It contains also other information 
about the estimation of which only R? is of interest to us so far. It is reported as R-squared. 
* A coefficient table. So far, we only discussed the OLS coefficients shown in the first column. 
The next columns will be introduced below. 
* A block of diagnostics regarding the residuals. We will discuss some of them later. 
When we are only interested in the coefficients and their significance, we will often switch to a 
more compact presentation of results. This is demonstrated with the object table in Script 2.9 
(Example-2-9.py). 


Wooldridge, Example 2.9: Voting Outcomes and Campaign Expenditures 


We already know the OLS coefficients to be By = 26.8125 and B, = 0.4638 in the voting example (Script 
2.5 (Example-2-5.py)) These values are again found in the output of Script 2.9 (Examp1e-2-9.py). 
The coefficient of determination is reported as R-squared to be R? = 0.856. Reassuringly, we get the 
same numbers as with the pedestrian calculations. 
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import wooldridge as woo 
import pandas as pd 

import statsmodels.formula.api as smf 
votel = woo.dataWoo('votel') 

# OLS regression: 


results = reg. fit() 


# print results using summary: 
print (f' results 


# print regression table: 
table = pd.DataFrame (('b': 
'si 
"e 
'pval': 
print(f'table: \n{table}\n’) 


round (ri 


Script 29: Example-2-9.py 


reg = smf.ols(formula-'voteA ~ shareA’ 


round(results.params, 
round(results.bse, 
round(results.tvalues, 


, data=votel) 


ummary (): \n{results.summary()}\n’) 


4), 
4), 
4), 


jults.pvalues, 4)}) 


results. summary (): 


Dep. Variabl voteA 
Model: OLS 
Method: Least Squares 
Date: Thu, 23 Apr 2020 
Time: 08:20:09 
No. Observations: 173 
Df Residuals: 171 
Df Model: 1 
Covariance Type: nonrobust 


Intercept 26.8122 0.887 
shareA 0.4638 0.015 
Omnibus: 20.747 
Prob (Omnibus) : 0.000 
Skew: 0.525 
Kurtosis 5.255 


Warnings: 
[1] Standard Errors assume 
table: 

b se t 
Intercept 26.8122 0.8872 30.2207 
shareA 0.4638 0.0145 31.9008 


that the covariance matrix of the 


Output of Script 2.9: Example-2-9.py 


OLS Regression Results 


R-squared 0.856 
Adj. R-squared: 0.855 
F-statistic: 1018. 
Prob (F-statistic): 6.63e-74 
Log-Likelihood: -565.20 
AIC: 1134. 
BIC: 1141. 


221 0.000 25.061 28.564 
901 0.000 0.435 0.493 
Durbin-Watson: 1.826 
Jarque-Bera (JB): 44.613 
Prob(JB): 2.05e-10 
Cond. No. 112 


errors 


pval 


0.0 
0.0 


is correctly 


88 2. The Simple Regression Model 


2.4. Nonlinearities 


For the estimation of logarithmic or semi-logarithmic models, the respective formula can be directly 
entered into the specification of sm£.ols(...) as demonstrated in Examples 2.10 and 2.11. For the 
interpretation as percentage effects and elasticities, see Wooldridge (2019, Section 2.4). 
Wooldridge, Example 2.10: Wage and Education 

Compared to Example 2.7, we simply change the command for the estimation to account for a log- 
arithmic specification as shown in Script 2.10 (Examp1e-2-10.py) The semi-logarithmic specification 
implies that wages are higher by about 8.3% for individuals with an additional year of education. 


Script 2.10: Example-2-10.py 
import numpy as np 
import wooldridge as woo 
import statsmodels.formula.api as smf 


wagel = woo.dataWoo(’wagel’) 


# estimate log-level model: 
reg = smf.ols(formula-'np.log(wage) ~ educ', data-wagel) 
results - reg.fit() 

b = results.params 

print(f'b: \n{b}\n’) 


Output of Script 2.10: Example-2-10.py |, 


b: 
Intercept 0.583773 
educ 0.082744 
dtype: float64 


Wooldridge, Example 2.11: CEO Salary and Firm Sales 


We study the relationship between the sales of a firm and the salary of its CEO using a log-log specifi- 
cation. The results are shown in Script 2.11 (Example-2-11.py). If the sales increase by 1%, the salary of 
the CEO tends to increase by 0.257%. 


Script 2.11: Example-2-11.py 


import numpy as np 
import wooldridge as woo 
import statsmodels.formula.api as smf 


ceosall = woo.dataWoo('ceosall') 


# estimate log-log model: 
reg = smf.ols(formula-'np.log(salary) ~ np.log(sales)', data-ceosall) 
results = reg. fit () 

b = results.params 

print(f'b: \n{b}\n’) 
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Output of Script 2.11: Example-2-11.py 


b: 
Intercept 4.821996 
np.log (sales) 0.256672 


dtype: floate4 


2.5. Regression through the Origin and Regression on a 
Constant 


Wooldridge (2019, Section 2.6) discusses models without an intercept. This implies that the re- 
gression line is forced to go through the origin. In Python, we can suppress the constant which is 
otherwise implicitly added to a formula by specifying 


smf.ols('y ~ 0 + x’, data-sample) 


instead of sm£.ols('y ~ x’, data=sample). The result is a model which only has a slope 
parameter. 

Another topic discussed in this section is a linear regression model without a slope parameter, i.e. 
with a constant only. In this case, the estimated constant will be the sample average of the dependent 
variable. This can be implemented in Python using the code 


smf.ols('y ~ 1’, data=sample) | 


Both special kinds of regressions are implemented in Script 2.12 (SLR-Origin-Const . py) for the 
example of the CEO salary and ROE we already analyzed in Example 2.8 and others. The resulting 
regression lines are plotted in Figure 2.3 which was generated using the last lines of code shown in 
the output. 


= Script 2.12: SLR-Origin-Const .py : 
import wooldridge as woo 

import numpy as np 

import statsmodels.formula.api as smf 
import matplotlib.pyplot as plt 


ceosall = woo.dataWoo('ceosall') 


# usual OLS regression: 

regl = smf.ols(formula-'salary ~ roe’, data-ceosall) 
resultsl = regl.fit() 

b 1 = resultsl.params 

print(f'b 1: \n{b_1}\n’) 


# regression without intercept (through origin): 

reg2 = smf.ols(formula-'salary ~ 0 + roe’, data-ceosall) 
results2 = reg2.fit() 

b_2 = results2.params 

print(f'b 2: \n{b_2}\n’) 


# regression without slope (on a constant): 
reg3 - smf.ols(formula-'salary - 1', data-ceosall) 
results3 = reg3.fit() 
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b_3 = results3.params 
print(f'b 3: \n{b_3}\n’) 


# average y: 
sal mean = np.mean(ceosall['salary']) 
print(f'sal mean: (sal mean)Wn') 


# scatter plot and fitted values 

plt.plot('/roe', ‘salary’, data-ceosall, color-'grey', marker-'o', 
linestyle=’’, label-'') 

plt.plot(ceosall['roe'], resultsl.fittedvalues, color-'black', 
linestyle-'-', label-'full') 

plt.plot(ceosall['roe'], results2.fittedvalues, color-'black', 
linestyle-':', label-'through origin') 

plt.plot(ceosall['roe'], results3.fittedvalues, color-'black', 
linestyle-'-.', label-'const only') 

plt.ylabel('salary') 

plt.xlabel('roe') 

plt.legend() 

plt.savefig('PyGraphs/SLR-Origin-Const.pdf') 


Output of Script 2.12: SLR-Origin-Const .py 


Intercept 963.191336 
roe 18.501186 
dtype: float64 


b_2: 
roe 63.537955 
dtype: float64 


b_3: 
Intercept 1281.119617 
dtype: float64 


sal mean: 1281.1196172248804 
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Figure 2.3. Regression through the Origin and on a Constant 
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2.6. Expected Values, Variances, and Standard Errors 


Wooldridge (2019) discusses the role of five assumptions under which the OLS parameter estimators 
have desirable properties. In short form they are 

* SLR.1: Linear population regression function: y = Bo + Bix + u 

* SLR.2: Random sampling of x and y from the population 

* SLR.3: Variation in the sample values x;, ..., Xn 

* SLR.4: Zero conditional mean: E(u|x) — 0 

* SLR.5: Homoscedasticity: Var(u|x) = o? 

Based on those, Wooldridge (2019) shows in Section 2.5: 
* Theorem 2.1: Under SLR.1 - SLR.4, OLS parameter estimators are unbiased. 
* Theorem 2.2: Under SLR.1 — SLR.5, OLS parameter estimators have a specific sampling vari- 
ance. 

Because the formulas for the sampling variance involve the variance of the error term, we also have 
to estimate it using the unbiased estimator 
n= 


= nc - Var(f;), (2.14) 


K 1 
P i n-2 


2 


n 
ici 
where Var(ù;) = zi; - LiL, f? is the usual sample variance. We have to use the degrees-of-freedom 
adjustment to account for the fact that we estimated the two parameters fy and Ê; for constructing 


the residuals. Its square root & = V6? is called standard error of the regression (SER) by Wooldridge 
(2019). 
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The standard errors (SE) of the estimators are 


ô 
‘Say va (2.15) 


4 & 1 [4 
sei) = / ELG-SPTVaci s) euo 


where sd(x) is the sample standard deviation "m EN (x; - 3). 

In Python, we can obviously do the calculations of Equations 2.15 through 2.16 explicitly. But the 
output of the sumnary command for linear regression results, which we discovered in Section 2.3, 
already contains the results. We use the following example to calculate the results in both ways to 
open the black box of the canned routine and convince ourselves that from now on we can rely on it. 


Wooldridge, Example 2.12: Student Math Performance and the School Lunch 
Program 


Using the data set MEAP93, we regress a math performance score of schools on the share of students 
eligible for a federally funded lunch program. Wooldridge (2019) uses this example to demonstrate the 
importance of assumption SLR.4 and warns us against interpreting the regression results in a causal way. 
Here, we merely use the example to demonstrate the calculation of standard errors. 

Script 2.13 (Examp1e-2-12.py) first calculates the SER manually using the fact that the residuals û are 
available as results.resid. Then, the SE of the parameters are calculated according to Equations 
2.15 and 2.16, where the regressor is addressed as the variable in the data frame meap93 [' 1nchprg' ]. 
Finally, we see the output of the summary method. The SE of the parameters are reported in the second 
column of the regression table, next to the parameter estimates. We will look at the other columns in 
Chapter 4. All values are exactly the same as the manual results. 


Script 2.13: Example-2-12.py — 


import numpy as np 
import wooldridge as woo 
import statsmodels.formula.api as smf 


meap93 = woo.dataWoo ('meap93') 

# estimate the model and save the results as "results": 
reg = smf.ols(formula-'mathlO ~ lnchprg', data-meap93) 
results - reg.fit() 


# number of obs.: 
n= 


u_hat_var = np.var(results.resid, ddof=1) 
SER = np.sqrt(u hat var) + np.sqrt((n - 1) / (n - 2)) 
print(f'SER: {SER}\n’) 


# SE of b0 & bl, respectively: 
1nchprg sq mean = np.mean(meap93['lnchprg'] ++ 2) 
1nchprg var = np.var(meap93['lnchprg'], ddof=1) 
bl se = SER / (np.sqrt(lnchprg var) 
* np.sqrt(n - 1)) + np.sqrt(lnchprg sq mean) 
b0 se = SER / (np.sqrt(lnchprg var) + np.sqrt(n - 1)) 
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print(f'bl s 
print(f'bO 


(bl se)Wn') 
{b0_se}\n’) 


# automatic calculations: 
print(f'results.summary(): \n{results.summary()}\n’) 


E — —— Output of Script 2.13: Example-2-12.py 
SER: 9.565938459482759 


bl se: 0.9975823856755018 
b0 se: 0.034839334258369624 


results.summary(): 


OLS Regression Results 


Dep. Variable: mathl0 — R-squared: 0.171 
Model: OLS Adj. R-squared: 0.169 
Method: Least Squares —F-statistic: 83.77 
Date: Thu, 23 Apr 2020 Prob (F-statistic): 2.75e-18 
Time: 08:20:16 — Log-Likelihood: -1499.3 
No. Observations: 408 AIC: 3003. 
Df Residuals: 406  BIC: 3011. 
Df Model: 1 

Covariance Type nonrobust 


coef std err t P»|t| [0.025 0.975] 


Intercept 32.1427 0.998 32.221 0.000 30.182 34.104 


lnchprg -0.3189 0.035 -9.152 0.000 -0.387 -0.250 
Omnibus 61.162 Durbin-Watson 1.908 
Prob (Omnibus): 0.000 Jarque-Bera (JB): 105.062 
Skew: 0.886 Prob(JB): 1.53e-23 
Kurtosis 4.743 Cond. No. 60.4 


Warnings: 
[1] Standard Errors assume that the covariance matrix of the errors is correctly 
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2.7. Monte Carlo Simulations 


In this section, we use Monte Carlo simulation experiments to revisit many of the topics covered 
in this chapter. It can be skipped but can help quite a bit to grasp the concepts of estimators, 
estimates, unbiasedness, the sampling variance of the estimators, and the consequences of violated 
assumptions. Remember that the concept of Monte Carlo simulations was introduced in Section 1.9. 


2.7.1. One Sample 


In Section 1.9, we used simulation experiments to analyze the features of a simple mean estimator. 
We also discussed the sampling from a given distribution, the random seed and simple examples. 
We can use exactly the same strategy to analyze OLS parameter estimators. 

Script 2.14 (SLR-Sim-Sample.py) shows how to draw a sample which is consistent with As- 
sumptions SLR.1 through SLR.5. We simulate a sample of size n = 1000 with population parameters 
Bo = 1 and B, = 0.5. We set the standard deviation of the error term u to v = 2. Obviously, these 
parameters can be freely chosen and every reader is strongly encouraged to play around. 


Script 2.14: SLR-Sim-Sample.py 
import numpy np 
import pandas as pd 
import statsmodels.formula.api as smf 
import scipy.stats as stats 
import matplotlib.pyplot as plt 


# set the random seed: 
np.random. seed (1234567) 


# set sample size: 
n = 1000 


draw a sample of size n: 
= stats.norm.rvs(4, 1, size=n) 
stats.norm.rvs(0, su, size=n) 
beta0 + betal * x + u 

f - pd.DataFrame(('y': y, 'x': x)) 


* 
x 
u 
Y 
d: 


4 estimate parameters by OLS: 

reg - smf.ols(formula-'y - x', data-df) 
reg.fit() 

b = results.params 

print(f'b: \n{b}\n’) 


# features of the sample for the variance formula: 
X Sq mean = np.mean(x ++ 2) 

print(f'x sq mean: (x sq mean)Wn') 

x var = np.sum((x - np.mean(x)) ** 2) 

print(f'x var: (x var)Wn') 


# graph: 
x range - np.linspace(0, 8, num-100) 
plt.ylim([-2, 10]) 
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plt.plot (x, y, color-'lightgrey', marker-'o', linestyle-'') 
plt.plot(x range, beta0 + betal + x range, color-'black', 


linestyle-'-', linewidth-2, label=’pop. regr. fct.') 
plt.plot(x range, b[0] + b[1] * x range, color-'grey', 
linestyle-'-', linewidth-2, label-'OLS regr. fct.') 


plt.ylabel('y') 

plt.xlabel('x') 

plt.legend() 
plt.savefig('PyGraphs/SLR-Sim-Sample.pdf') 


Output of Script 2.14: SLR-Sim-Sample.py 


b: 
Intercept 1.190238 
x 0.444255 


dtype: float64 
x sq mean: 17.27675304867723 


x var: 953.7353266586754 


Then a random sample of x and y is drawn in three steps: 

* A sample of regressors x is drawn from an arbitrary distribution. The only thing we have to 
make sure to stay consistent with Assumption SLR3 is that its variance is strictly positive. We 
choose a normal distribution with mean 4 and a standard deviation of 1. 

* A sample of error terms u is drawn according to Assumptions SLR.4 and SLR.5: It has a mean 
of zero, and both the mean and the variance are unrelated to x. We simply choose a normal 
distribution with mean 0 and standard deviation 7 = 2 for all 1000 observations independent 
of x. In Sections 2.7.3 and 2.7.4 we will adjust this to simulate the effects of a violation of these 
assumptions. 

* Finally, we generate the dependent variable y according to the population regression function 
specified in Assumption SLR.1. 

In an empirical project, we only observe x and y and not the realizations of the error term u. In 
the simulation, we "forget" them and the fact that we know the population parameters and estimate 
them from our sample using OLS. As motivated in Section 1.9, this will help us to study the behavior 
of the estimator in a sample like ours. 

For our particular sample, the OLS parameter estimates are B = 1.190238 and f = 0.444255. 
The result of the graph generated in the last lines of Script 2.14 (SLR-Sim-Sample.py)is shown in 
Figure 2.4. It shows the population regression function with intercept Bo = 1 and slope fı = 0.5. It 
also shows the scatter plot of the sample drawn from this population. This sample led to our OLS 
regression line with intercept Êo = 1.190238 and slope f; = 0.444255 shown in gray. 

Since the SLR assumptions hold in our exercise, Theorems 2.1 and 2.2 of Wooldridge (2019) should 
apply. Theorem 2.1 implies for our model that the estimators are unbiased, i.e. 


E(Bo) = Bo — 1 E(B1) = i = 05 


The estimates obtained from our sample are relatively close to their population values. Obviously, 
we can never expect to hit the population parameter exactly. If we change the random seed by 
specifying a different number in Script 2.14 (SLR-Sim-Samp1e. py), we get a different sample and 
different parameter estimates. 
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Figure 2.4. Simulated Sample and OLS Regression Line 


10 
— pop. regr. fet. 
— OLS regr. fet. 


-2 


Theorem 2.2 of Wooldridge (2019) states the sampling variance of the estimators conditional on 
the sample values (x;,...,x,). It involves the average squared value x? = 17.277 and the sum of 
squares E? ,(x — x)? = 953.735 which we also know from the Python output: 


= 4:17.27 

) 953735 
4 

Var(B1) = EXQ-3) ^ 953735 


= 0.0725 


= 0.0042 


If Wooldridge (2019) is right, the standard error of Ê; is v/0.0042 = 0.0648. So getting an estimate of 
1 = 0.444 for one sample doesn't seem unreasonable given B; = 0.5. 


2.7.2. Many Samples 


Since the expected values and variances of our estimators are defined over separate random samples 
from the same population, it makes sense for us to repeat our simulation exercise over many simu- 
lated samples. Just as motivated in Section 1.9, the distribution of OLS parameter estimates across 
these samples will correspond to the sampling distribution of the estimators. 

Script 2.16 (SLR-Sim-Model-Condx . py) implements this with the same for loop we introduced 
in Section 1.8.2 and already used for basic Monte Carlo simulations in Section 1.9.1. Remember that 
Python enthusiasts might choose a different technique but for us, this implementation has the big 
advantage that it is very transparent. We analyze r = 10000 samples. 

Note that we use the same values for x in all samples since we draw them outside of the loop. We 
do this to simulate the exact setup of Theorem 2.2 which reports the sampling variances conditional 
on x. Ina more realistic setup, we would sample x along with y. The conceptual difference is subtle 
and the results hardly differ in reasonably large samples. We will come back to these issues in 
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Chapter 5.? For each sample, we estimate our parameters and store them in the respective position 
j=0,...,r—1 of the arrays b0 and b1. 


Script 2.16: SLR-Sim-Model-Condx.py 
import numpy as np 
import pandas as pd 
import statsmodels.formula.api as smf 
import scipy.stats as stats 
import matplotlib.pyplot as plt 


# set the random seed: 
np. random. seed (1234567) 


# set sample size and number of simulations: 
n = 1000 
r = 10000 


# set true parameters (betas and sd of u): 


# initialize b0 and bl to store results later: 
bO = np.empty(r) 
bl = np.empty(r) 


# draw a sample of x, fixed over replications: 
x = stats.norm.rvs(4, 1, size=n) 


# repeat r times: 
for i in range(r): 
# draw a sample of y: 
u = stats.norm.rvs(0, su, size=n) 
y = beta0 + betal + x +u 
df = pd.DataFrame(('y': y, ‘x’: x}) 


# estimate and store parameters by OLS: 
reg = smf.ols(formula-'y ~ x’, data-df) 
results = reg. fit() 

bO[i] = results.params['Intercept'] 
bl[i] = results.params['x'] 

# MC estimate of the expected values: 
b0 mean = np.mean(b0) 

bl mean = np.mean (b1) 


print(f'b0 mean: (b0 mean) An') 
print(f'bl mean: (bl mean) |n') 


# MC estimate of the variances: 
b0 var - np.var(b0, ddof-1) 
bl var = np.var(bl, ddof=1) 


print(f'b0 var: (bO0 var)An') 
print(f'bl var: (bl var)An') 


?In Script 2.15 (SLR-Sim-Model . py) shown on page 340, we implement the joint sampling from x and y. The results are 
essentially the same. 
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# graph: 
x range = np.linspace(0, 8, num-100) 
plt.ylim([0, 6]) 


# add population regression line: 
plt.plot(x range, beta0 + betal + x range, color-'black', 
linestyle-'-', linewidth-2, label-'Population') 


# add first OLS regression line (to attach a label): 
plt.plot(x range, b0[0] + b1[0] + x range, color-'grey', 
linestyle-'-', linewidth-0.5, label-'OLS regressions') 


# add OLS regression lines no. 2 to 10: 
for i in range(1, 10): 
plt.plot(x range, bO[i] + bl[i] + x range, color-'grey', 
linestyle-'-', linewidth-0.5) 
plt.ylabel('y') 
plt.xlabel('x') 
plt.legend() 
plt.savefig('PyGraphs/SLR-Sim-Model-Condx.pdf') 


[E ——— — — — —— Output of Script 2.16: SLR-Sim-Model-Condx.py 
b0 mean: 1.00329460319241 


bl mean: 0.49936958775965984 


b0 var: 0.07158103946245628 


bl var: 0.004157652196227234 


Script 2.16 (SLR-Sim-Model-Condx.py) gives descriptive statistics of the r = 10,000 estimates 
we got from our simulation exercise. Wooldridge (2019, Theorem 2.1) claims that the OLS estimators 
are unbiased, so we should expect to get estimates which are very close to the respective population 
parameters. This is clearly confirmed. The average value of fo is very close to Bo = 1 and the average 
value of fy is very close to B1 = 0.5. 

The simulated sampling variances are Var(Bo) — 0.0716 and Var(fi) — 0.0042. Also these values 
are very close to the ones we expected from Theorem 2.2. The last lines of the code produce Figure 
2.5. It shows the OLS regression lines for the first 10 simulated samples together with the population 
regression function. 


2.7.3. Violation of SLR.4 


We will come back to a more systematic discussion of the consequences of violating the SLR assump- 
tions below. At this point, we can already simulate the effects. In order to implement a violation of 
SLRA (zero conditional mean), consider a case where in the population u is not mean independent 
of x. A simple example is 
x-4 

5 


What happens to our OLS estimator? Script 2.17 (SLR-Sim-Model-ViolSLR4 .py) implements a 
simulation of this model and is listed in the appendix (p. 342). 


E(u|x) = 
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Figure 2.5. Population and Simulated OLS Regression Lines 


— Population 
OLS regressions 


The only line of code we changed compared to Script 2.16 (SLR-Sim-Model-Condx.py) is the 
sampling of u which now reads 


u mean = np.array((x - 4) / 5) 
u = stats.norm.rvs(u mean, su, size-n) 


The simulation results are presented in the output of Script 2.17 (SLR-Sim-Model-ViolSLR4.py). 
Obviously, the OLS coefficients are now biased: The average estimates are far from the population 
parameters Bo = 1 and B; = 0.5. This confirms that Assumption SLR.4 is required to hold for the 
unbiasedness shown in Theorem 2.1. 


[— — — ———— —— Output of Script 2.17; SLR-Sim-Model-ViolSLR4.py 
b0. mean: 0.2032946031924096 


bl mean: 0.6993695877596598 
b0 var: 0.07158103946245628 


bl var: 0.004157652196227234 


2.7.4. Violation of SLR.5 


Theorem 2.1 (unbiasedness) does not require Assumption SLR.5 (homoscedasticity), but Theorem 
2.2 (sampling variance) does. As an example for a violation consider the population specification 


Var(u|x) — 


so SLR.5 is clearly violated since the variance depends on x. We assume exogeneity, so assumption 
SLR.4 holds. The factor in front ensures that the unconditional variance Var(u) = 4.? Based on this 


3Since x ~ Normal(4, 1), e* is log-normally distributed and has a mean of e*5. 
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unconditional variance only, the sampling variance should not change compared to the results above 
and we would still expect Var(Bo) = 0.0716 and Var(f;) = 0.0042. But since Assumption SLR.5 is 
violated, Theorem 2.2 is not applicable. 

Script 2.18 (SLR-Sim-Model-ViolSLR5.py) implements a simulation of this model and is listed 
in the appendix (p. 342). Here, we only had to change the line of code for the sampling of u to 


u_var = np.array(4 / np.exp(4.5) * np.exp(x)) 
u = stats.norm.rvs(0, np.sqrt(u var), size-n) 


The output of Script 2.18 (SLR-Sim-Model-ViolSLR5.py) demonstrates two effects: The unbi- 
asedness provided by Theorem 2.1 is unaffected, but the formula for sampling variance provided by 
Theorem 2.2 is incorrect. 


[— —— — — — Output of Script 2.18: SLR-Sim-Model-ViolSLR5.py 
b0 mean: 1.001414297039418 


bl mean: 0.4997594115253497 


b0 var: 0.13175544492656727 


bl var: 0.010016166348092534 


3. Multiple Regression Analysis: Estimation 


Running a multiple regression in Python is as straightforward as running a simple regression using 
the ols command in stat smodels. Section 3.1 shows how it is done. Section 3.2 opens the black 
box and replicates the main calculations using matrix algebra. This is not required for the remaining 
chapters, so it can be skipped by readers who prefer to keep black boxes closed. 

Section 3.3 should not be skipped since it discusses the interpretation of regression results and the 
prevalent omitted variables problems. Finally, Section 3.4 covers standard errors and multicollinear- 
ity for multiple regression. 


3.1. Multiple Regression in Practice 


Consider the population regression model 
y = Bo + Pix + B2x2 + Bax3 + +++ + Bye (3.1) 


and suppose the data set sample contains variables y, x1, x2, x3, with the respective data of our 
sample. We estimate the model parameters by OLS using the commands 


reg = smf.ols(formula-'y ~ xl + x2 + x3’, data-sample) 
results = reg. fit () 


The tilde “~” again separates the dependent variable from the regressors which are now separated 
using a “+” sign. We can add options as before. The constant is again automatically added unless it 
is explicitly suppressed using 'y ~ 0 + x1 + x2 + x3 + ...'. 

We are already familiar with the workings of sm£.ols and fit: The first command creates an 
object which contains all relevant information and the estimation is performed in a second step. The 
estimation results are stored in a variable results using the code results = reg.fit(). We 
can use this variable for further analyses. For a typical regression output including a coefficient table, 
call results.summary (). Of course if this is all we want, we can leave these steps and simply call 
smf.ols(...).fit().summary() in one step. Further analyses involving residuals, fitted values 
and the like can be used exactly as presented in Chapter 2. 

The output of summary includes parameter estimates, standard errors according to Theorem 3.2 
of Wooldridge (2019), the coefficient of determination R?, and many more useful results we cannot 
interpret yet before we have worked through Chapter 4. 


Wooldridge, Example 3.1: Determinants of College GPA 


This example from Wooldridge (2019) relates the college GPA (coicpa) to the high school GPA (hsGPa) 
and achievement test score (AcT) for a sample of 141 students. The commands and results can be 
found in Script 3.1 (Examp1e-3-1.py). The OLS regression function is 


COlGPA = 1.286 + 0.453 - hsGPA + 0.0094 - ACT. 
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Script 3.1: Example-3-1.py 
import wooldridge as woo 
import statsmodels.formula.api as smf 


gpal = woo.dataWoo('gpal') 
reg = smf.ols(formula-'colGPA ~ hsGPA + ACT’, data-gpal) 


results = reg. fit () 
print (f/ results.summary(): \n{results.summary()}\n’) 


Output of Script 3.1: Example-3-1.py 
results.summary () : 


OLS Regression Results 


Dep. Variabl colGPA R-squared 0.176 
Model: OLS Adj. R-squared: 0.164 
Method: Least Squares F-statistic: 14.78 
Date: Tue, 12 May 2020 Prob (F-statistic): 1.53e-06 
Time: 10:34:11 Log-Likelihood: -46.573 
No. Observations: 141 AIC: 99.15 
Df Residuals: 138  BIC: 108.0 
Df Model: 2 

Covariance Type nonrobust 


coef std err t P»|tl [0.025 0.975] 


hsGPA 0.4535 0.096 4.733 0.000 0.264 


Omnibus: 3.056 Durbin-Watson: 
Prob (Omnibus): 0.217 Jarque-Bera (JB): 
Skew: 0.199 Prob (JB): 

Kurtosi Cond. No 


Warnings: 
[1] Standard Errors assume that the covariance matrix of the errors is correctly 


Wooldridge, Example 3.4: Determinants of College GPA 


For the regression run in Example 3.1, the output of Script 3.1 (Example-3-1 . py) reports R? = 0.176, so 
about 17.6% of the variance in college GPA is explained by the two regressors. 


Examples 3.2, 3.3, 3.5, 3.6: Further Multiple Regression Examples 

In order to get a feeling of the methods and results, we present the analyses including 
the full regression tables of the mentioned Examples from Wooldridge (2019) in Scripts 3.2 
(Example-3-2.py) through 3.6 (Exampl1e-3-6.py). See Wooldridge (2019) for descriptions of 
the data sets and variables and for comments on the results. 
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Script 32: Example-3-2.py 


import paie, as woo 


import numpy as 
import statsmodels. formula. api as smf 


wagel = woo.dataWoo('wagel') 


results = reg.fit() 


print (f' results.summary(): \n{results.summary()}\n’) 


reg = smf.ols(formula-'np.log(wage) ~ educ + exper + tenure’ 


, data-wagel) 


Output of Script 3.2: Example-3-2.py 
results.summary(): 
OLS Regression Results 


Dep. Variabl np.log(wage) ^ R-squared 
Model: OLS Adj. R-squared: 
Method: Least Squares  F-statistic: 

Date: Tue, 12 May 2020 Prob (F-statistic): 
Time: 10:34:13  Log-Likelihood: 

No. Observations: 526 AIC: 

Df Residuals: 522 BIC: 

Df Model: 3 

Covariance Type: nonrobust 


coef std err t 


P»|t| 


Intercept 0.2844 0.104 2.729 0.007 
educ 0.0920 0.007 12.555 0.000 
exper 0.0041 0.002 2.391 0.017 
tenure 0.0221 0.003 7.133 0.000 


Omnibus: Durbin-Watson: 
Prob (Omnibus) : Jarque-Bera (JB): 
Skew: Prob(JB): 
Kurtosis: Cond. No 


Warnings: 


[1] Standard Errors assume 


0.316 
0.312 
80.39 
9.13e-43 
4313,55. 
635.1 
652.2 


.025 0. 


0.080 0.489 
0.078 0.106 
0.001 0.008 
0.016 0.028 
1.769 
20.941 
2.84e-05 
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that the covariance matrix of the errors is correctly 
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Script 3.3: Example-3-3.py 


import wooldridge as woo 


import numpy as np 
import statsmodels.formula.api as smf 


k401k = woo.dataWoo(’ 401k’) 
reg = smf.ols(formula-'prate ~ mrate + age’, data=k401k) 


results = reg.fit() 
print (f’results.summary(): \n{results.summary()}\n’) 


Output of Script 3.3: Example-3-3.py 
results. summary (): 


OLS Regression Results 


Dep. Variable 


0.092 


prate R-square: 


Model: OLS Adj. R-squared: 0.091 
Method: Least Squares F-statistic: 77.79 
Date: Tue, 12 May 2020 Prob (F-statistic): 6.67e-33 
Time: 10:34:15 Log-Likelihood: -6422.3 
No. Observations: 1534 AIC: 1.285e+04 
Df Residuals: 1531 BIC: 1.287e+04 
Df Model: 2 

Covariance Type nonrobust 


coef std err t P>itl [0.025 0.975] 
Intercept 80.1190 0.779 102.846 0.000 78.591 81.647 
mrate 5.5213 0.526 10.499 0.000 4.490 6.553 
age 0.2431 0.045 5.440 0.000 0.155 


0.331 


1.910 


Omnibu 375.579 Durbin-Wat so; 

Prob (Omnibus): 0.000 Jarque-Bera (JB): 805.992 
Skew: -1.387 Prob (JB) : 9.57e-176 
Kurtosis 5.217 Cond. No. 32.9 


Warnings: 
[1] Standard Errors assume 


that the covariance matrix of the errors is correctly 
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~ Script 3.4: Example-3-5a.py 
import wooldridge as woo 
import statsmodels.formula.api as smf 


crimel = woo.dataWoo(’crimel’) 


# model without avgsen: 

reg = smf.ols(formula-'narr86 ~ pcnv + ptime86 + qemp86’, data-crimel) 
results = reg. fit () 

print (f' results.summary(): \n{results.summary() }\n’) 


Output of Script 3.4: Example-3-5a.py 
results.summary(): 


OLS Regression Results 


narr86 — R-squared 0.041 

OLS Adj. R-squared: 0.040 

Least Squares F-statistic: 39.10 

Tue, 12 May 2020 Prob (F-statistic): 9.91e-25 

10:34:16  Log-Likelihood: -3394.7 

No. Observations: 2725 AIC: 6797. 

Df Residuals: 2721 BIC: 6821. 
Df Model: 3 
Covariance Type: nonrobust 


coef std err t P>itl [0.025 0.975] 


Intercept 0.7118 0.033 21.565 0.000 0.647 0.776 
penv -0.1499 0.041 -3.669 0.000 -0.230 -0.070 
ptime86 -0.0344 0.009 -4.007 0.000 -0.051 -0.018 
qemp86 -0.1041 0.010 710.023 0.000 -0.124 -0.084 


Omnibus: Durbin-Watson: 1.836 
Prob (Omnibus): Jarque-Bera (JB): 106169.153 
Skew: Prob(JB): 0.00 
Kurtosis: Cond. No 


Warnings: 
[1] Standard Errors assume that the covariance matrix of the errors is correctly 
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~ Script 3.5: Example-3-5b.py 
import wooldridge as woo 
import statsmodels.formula.api as smf 


crimel = woo.dataWoo('crimel') 


# model with avgsen: 

reg = smf.ols(formula-'narr86 ~ pcnv + avgsen + ptime86 + gemp86’, data-crimel) 
results = reg. fit () 

print (f/results.summary(): \n{results.summary()}\n’) 


Output of Script 3.5: Example-3-5b.py 
results. summary () : 


OLS Regression Results 


0.042 


Dep. Variable narr86 — R-squarei 

Model: OLS Adj. R-squared: 0.041 
Method: Least Squares F-statistic: 29.96 
Date: Tue, 12 May 2020 Prob (F-statistic): 2.01e-24 
Time: 10:34:17  Log-Likelihood: -3393.5 
No. Observations: 2725 AIC: 6797. 
Df Residuals: 2720 BIC: 6826. 
Df Model: 4 


Covariance Type nonrobust 


coef std err t P>itl [0.025 0.975] 
Intercept 0.7068 0.033 21.319 0.000 0.642 0.772 
penv -0.1508 0.041 -3.692 0.000 -0.231 -0.071 
avgsen 0.0074 0.005 1.572 0.116 -0.002 0.017 
ptime86 -0.0374 0.009 -4.252 0.000 -0.055 -0.020 
qemp86 -0.1033 0.010 -9.940 0.000 -0.124 -0.083 


Omnibu 2396.990 1.837 
Prob (Omnibus): 0.000 Jarque-Bera (JB): 106841.658 
Skew: 4.006 Prob (JB): 0.00 
Kurtosis 


32.611 Cond. No. 10.2 


Warnings: 
[1] Standard Errors assume that the covariance matrix of the errors is correctly 
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Script 3.6: Example-3-6.py 
import wooldridge as woo 

import numpy as np 

import statsmodels.formula.api as smf 


wagel = woo.dataWoo(’wagel’) 
reg = smf.ols(formula-'np.log(wage) ~ educ’, data-wagel) 


results = reg.fit() 
print(f'results.summary(): \n{results.summary()}\n’) 


Output of Script 3.6: Example-3-6.py 
results.summary(): 


OLS Regression Results 


0.186 


Dep. Variable np.log(wage) ^ R-square 
Model: OLS Adj. R-squared: 0.184 
Method: Least Squares — F-statistic: 119.6 
Date: Tue, 12 May 2020 Prob (F-statistic): 3.27e-25 
Time: 10:34:19  Log-Likelihood: -359.38 
No. Observations: 526 AIC: 722.8 
Df Residuals: 524 BIC: 731.3 


Df Model: 1 
Covariance Typ nonrobust 


Intercept 0.5838 0.097 5.998 0.000 0.393 0.775 
educ 0.0827 0.008 10.935 0.000 0.068 0.098 


Omnibus 


Durbin-Watson 1.801 
Prob (Omnibus): Jarque-Bera (JB): 13.811 
Skew: Prob(JB): 0.00100 
Kurtosis: Cond. No 60.2 


Warnings: 
[1] Standard Errors assume that the covariance matrix of the errors is correctly 


3.2. OLS in Matrix Form 


For applying regression methods to empirical problems, we do not actually need to know the for- 
mulas our software uses. In multiple regression, we need to resort to matrix algebra in order to find 
an explicit expression for the OLS parameter estimates. Wooldridge (2019) defers this discussion to 
Appendix E and we follow the notation used there. Going through this material is not required for 
applying multiple regression to real-world problems but is useful for a deeper understanding of the 
methods and their black-box implementations in software packages. In the following chapters, we 
will rely on the comfort of the canned routine £it, so this section may be skipped. 

In matrix form, we store the regressors in a n x (k-- 1) matrix X which has a column for 
each regressor plus a column of ones for the constant. The sample values of the dependent 
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variable are stored in a n x 1 column vector y. Wooldridge (2019) derives the OLS estimator 
B = (Bo By Ba... Bi)! to be 
Ê= QU X'y. (3.2) 

This equation involves three matrix operations which we know how to implement in Python from 
Section 1.2.3: 

* Transpose: The expression X' is X. T in numpy 

* Matrix multiplication: The expression X’X is translated as X. T @ X 

e Inverse: (X'X)^! is written as np. linalg.inv(X.T @ X) 
So we can collect everything and translate Equation 3.2 into the somewhat unsightly expression 


b = np.linalg.inv(X.T @ X) @xX.T@y 


The vector of residuals can be manually calculated as 
ü-y-Xxp (3.3) 


or translated into the numpy matrix language 


Wise cR 


The formula for the estimated variance of the error term is 
P= Hti (3.4) 


which is equivalent to 


sigsq hat = (u hat.T @ u hat) / (n - k - 1) | 


The standard error of the regression (SER) is its square root ô = V62. The estimated OLS variance- 
covariance matrix according to Wooldridge (2019, Theorem E.2) is then 


Var(B) = 67(x’x)~! (3.5) 


Vb hat = sigsq hat * np.linalg.inv(X.T 8 X) 


Finally, the standard errors of the parameter estimates are the square roots of the main diagonal of 
Var(B) which can be expressed in numpy as 


se = np.sqrt (np.diagonal (Vb_hat)) 


Script 3.7 (OLS-Mat rices . py) implements this for the GPA regression from Example 3.1. Compar- 
ing the results to the built-in function (see Script 3.1 (Examp1e-3-1 . py)), it is reassuring that we get 
exactly the same numbers for the parameter estimates and standard errors of the coefficients. Script 
3.7 (OLS-Mat rices . py) also demonstrates another way of generating y and X by using the module 
patsy. It includes the command dmatrices, which allows to conveniently create the matrices by 
formula syntax. 
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Script 3.7: OLS-Matrices.py 
import wooldridge as woo 
import numpy as np 
import pandas as pd 
import patsy as pt 


gpal = woo.dataWoo (’ gpal’) 


# determine sample size & no. of regressors: 
n = len(gpal) 
k=2 


# extract y: 
y = gpal['colGPA'] 


# extract X & add a column of ones: 
X = pd.DataFrame(('const': 1, 'hsGPA': gpal['hsGPA'], ‘ACT’: gpal['ACT'])) 


# alternative with patsy: 
y2, X2 = pt.dmatrices('colGPA ~ hsGPA + ACT’, data-gpal, return type-'dataframe') 


# display first rows of X: 
print(f'X.head(): \n{X.head() )Wn') 


# parameter estimates: 

X = np.array (X) 

np.array(y).reshape(n, 1) # creates a row vector 
b = np.linalg.inv(X.T Q X) @ X.T Q y 

print(f'b: \n{b}\n’) 


# residuals, estimated variance of u and SER: 
uhat-y-X& b 

sigsq hat = (u hat.T @ u hat) / (n- k - 1) 
SER = np.sqrt(sigsq hat) 

print(f'SER: (SER)Wn') 


# timated variance of the parameter estimators and SE: 
Vbeta_hat = sigsq hat * np.linalg.inv(X.T @ X) 

se = np.sqrt (np.diagonal (Vbeta_hat) ) 

print(f'se: (se)Wn') 


Output of Script 3.7: OLS-Matrices.py 


const hsGPA ACT 


0 1 3.0 21 
1 t 3.2 24 
a 1 3.6 26 
3 1 3:5 27 
4 1 3.9 28 
b: 
[[1.28632777] 
[0.45345589] 
[0.00942601]] 


SER: [[0.34031576]] 


se: [0.34082212 0.09581292 0.01077719] 
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3.3. Ceteris Paribus Interpretation and Omitted Variable Bias 


The parameters in a multiple regression can be interpreted as partial effects. In a general model with 
k regressors, the estimated slope parameter B; associated with variable x; is the change of 9 as x; 
increases by one unit and the other variables are held fixed. 

Wooldridge (2019) discusses this interpretation in Section 3.2 and offers a useful formula for in- 
terpreting the difference between simple regression results and this ceteris paribus interpretation of 
multiple regression: Consider a regression with two explanatory variables: 


9 = Bo + Bie Boxe. (3.6) 


The parameter f; is the estimated effect of increasing x, by one unit while keeping x, fixed. In 
contrast, consider the simple regression including only x; as a regressor: 


9 = Bor Bim. (37) 


The parameter Bi is the estimated effect of increasing x; by one unit (and NOT keeping x fixed). It 
can be related to B, using the formula 


Bi = Bi + Bad G8) 
where 4; is the slope parameter of the linear regression of x2 on x} 
x = p + 6x1. (3.9) 


This equation is actually quite intuitive: As x; increases by one unit, 
* Predicted y directly increases by f; units (ceteris paribus effect, Equ. 3.6). 
* Predicted x? increases by 5, units (see Equ. 3.9). 
* Each of these ĝ units leads to an increase of predicted y by f; units, giving a total indirect 
effect of Dr (see again Equ. 3.6) 
* The overall effect B; is the sum of the direct and indirect effects (see Equ. 3.8). 

We revisit Example 3.1 to see whether we can demonstrate Equation 3.8 in Python. Script 3.8 
(Omitted-Vars.py) repeats the regression of the college GPA (co1GPA) on the achievement test 
score (ACT) and the high school GPA (hsGPA). We study the ceteris paribus effect of ACT on colGPA 
which has an estimated value of Bi — 0.0094. The estimated effect of hsGPA is po = 0.453. The slope 
parameter of the regression corresponding to Equation 3.9 is à = 0.0389. Plugging these values into 
Equation 3.8 gives a total effect of B1 = 0.0271 which is exactly what the simple regression at the end 
of the output delivers. 

In this example, the indirect effect is actually stronger than the direct effect. ACT predicts colGPA 
mainly because it is related to hsGPA which in turn is strongly related to co1GPA. 

These relations hold for the estimates from a given sample. In Section 3.3, Wooldridge (2019) 
discusses how to apply the same sort of arguments to the OLS estimators which are random variables 
varying over different samples. Omitting relevant regressors causes bias if we are interested in 
estimating partial effects. In practice, it is difficult to include all relevant regressors making of 
omitted variables a prevalent problem. It is important enough to have motivated a vast amount 
of methodological and applied research. More advanced techniques like instrumental variables or 
panel data methods try to solve the problem in cases where we cannot add all relevant regressors, 
for example because they are unobservable. We will come back to this in Part 3. 
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Script 3.8: Omitted-Vars.py 
import wooldridge as woo 
import statsmodels.formula.api as smf 


gpal = woo.dataWoo ('gpal') 


# parameter estimates for full and simple model: 

reg = smf.ols(formula-'colGPA ~ ACT + hsGPA', data-gpal) 
results - reg.fit() 

b - results.params 

print(f'b: \n{b}\n’) 


# relation between regressors: 

reg delta = smf.ols(formula-'hsGPA ~ ACT’, data-gpal) 
results delta - reg delta.fit() 

delta tilde - results delta.params 

print(f'delta tilde: Wn(delta tilde)Wn') 


# omitted variables formula for bl tilde: 
bl tilde - b['ACT'] * b['hsGPA'] * delta tilde['ACT'] 
print(f'bl tilde: \n{bl_tilde}\n’) 


# actual regression with hsGPA omitted: 

reg om = smf.ols(formula-'colGPA ~ ACT’, data-gpal) 
results om - reg om.fit() 

ults om.params 

print(f'b om: \n{b_om}\n’) 


Output of Script 3.8: Omitted-Vars.py 
b: 
Intercept 1.286328 
ACT 0.009426 
hsGPA 0.453456 
dtype: float64 


delta tilde: 
Intercept 2.462537 
ACT 0.038897 
dtype: floated 


bl tilde: 
0.02706397394317861 


b om: 
Intercept 2.402979 
ACT 0.027064 


dtype: float64 
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3.4. Standard Errors, Multicollinearity, and VIF 


We have already seen the matrix formula for the conditional variance-covariance matrix under the 
usual assumptions including homoscedasticity (MLR.5) in Equation 3.5. Theorem 3.2 provides an- 
other useful formula for the variance of a single parameter £j, i.e. for a single element on the main 
diagonal of the variance-covariance matrix: 


La i ow 1 
Var(ĝ;) = SST(1- RP) n Varx) 1-R 


(3.10) 


where SST; = Li, (xj — xj)? = (n — 1) - Var(x;) is the total sum of squares and RF is the usual 
coefficient of determination from a regression of x; on all of the other regressors.! 
The variance of Bj consists of four parts: 

* 1: The variance is smaller for larger samples. 

e o?: The variance is larger if the error term varies a lot, since it introduces randomness into the 
relationship between the variables of interest. 

. Val The variance is smaller if the regressor x; varies a lot since this provides relevant 
information about the relationship. 

. TE This variance inflation factor (VIF) accounts for (imperfect) multicollinearity. If xj is 
highly related to the other regressors, R? and therefore also VIF; and the variance of Bj are 
large. 

Since the error variance v? is unknown, we replace it with an estimate to come up with an esti- 
mated variance of the parameter estimate. Its square root is the standard error 


à 1 e 1 
se(Bj) = -= ——- == (3.11) 
Va a) fi 

It is not directly obvious that this formula leads to the same results as the matrix formula in 
Equation 3.5. We will validate this formula by replicating Example 3.1 which we also used for 
manually calculating the SE using the matrix formula above. The calculations are shown in Script 
3.9 (MLR-SE. py). 

We also use this example to demonstrate how to extract results which are included in the object 
returned by the £it method. Given its results are stored in variable sures using the results of sures 
= smf.ols(...).fit(), we can easily access the information using sures . resultname where 
the resultname can be any of the following: 

* params for the regression coefficients 
* resid for the residuals 

* mse_resid for the (squared) SER 

* rsquared for R? 

* and more. 


‘Note that here, we use the population variance formula Var(x;) = } Ef; (xj — xj)". 
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Script 3.9: MLR-SE.py 
import wooldridge as woo 


import numpy as np 
import statsmodels.formula.api as smf 


gpal = woo.dataWoo ('gpal') 


# full estimation results including automatic SE: 
reg = smf.ols(formula-'colGPA ~ hsGPA + ACT’, data-gpal) 
results - reg.fit() 


# extract SER (instead of calculation via residuals): 
SER - np.sqrt(results.mse resid) 


# regressing hsGPA on ACT for calculation of R2 & VIF: 
reg hsGPA = smf.ols(formula-'hsGPA ~ ACT’, data-gpal) 
results hsGPA - reg hsGPA.fit() 

R2 hsGPA - result. 
VIF hsGPA = 1 / (1 
print(f'VIF hsGPA: (VIF hsGPA)n') 


# manual calculation of SE of hsGPA coefficient: 
n = results.nobs 

sdx = np.std(gpal['hsGPA'], ddof-1) + np.sqrt((n - 1) / n) 
SE hsGPA = 1 / np.sqrt(n) * SER / sdx * np.sqrt(VIF hsGPA) 
print(f'SE hsGPA: (SE hsGPA)in') 


— — —————— Output of Script 3.9: MLR-SE.py — 
1.1358234481972789 


SE_hsGPA: 0.09581291608057597 


This is used in Script 3.9 (MLR-SE . py) to extract the SER of the main regression and the R from 


the regression of hsGPA on ACT which is needed for calculating the VIF for the coefficient of hsGPA.? 
The other ingredients of Equation 3.11 are straightforward. The standard error calculated this way 
is exactly the same as the one of the built-in command and the matrix formula used in Script 3.7 
(OLS-Matrices.py). 

A convenient way to automatically calculate variance inflation factors (VIF) is pro- 
vided by the module statsmodels in stats.outliers influence. The command 
variance inflation factor(X, regressornumber) delivers the VIF for a matrix X 
and the number of a given regressor (starting with the constant as the regressor with number 0). 
The calculation for each of the regressors is performed in a loop as demonstrated in Script 3.10 
(MLR-VIF.py). 

We extend Example 3.6. and regress individual log wage on education (educ), potential overall 
work experience (exper), and the number of years with current employer (tenure). We could 
imagine that these three variables are correlated with each other, but the results show no big VIF. 
The largest one is for the coefficient of exper. Its variance is higher by a factor of (only) 1.478 than 
in a world in which it were uncorrelated with the other regressors. So we don't have to worry about 
multicollinearity here. 


2We could have calculated these values manually like in Scripts 2.8 (Examp1e-2-8.py), 2.13 (Examp1e-2-12.py) or 37 
(OLS-Matrices.py). 
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Script 3.10: MLR-VIF . py 
import wooldridge as woo 

import numpy as np 

import statsmodels.stats.outliers_influence as smo 
import patsy as pt 


wagel = woo.dataWoo ('wagel') 


# extract matrices using patsy: 
y, X = pt.dmatrices(’np.log(wage) ~ educ + exper + tenure’, 
data-wagel, return type-'dataframe') 


# get VIF: 
K = X.shape[1] 
VIF = np.empty (K) 
for i in range(K) : 
VIF[i] = smo.variance inflation factor(X.values, i) 
print(f'VIF: \n{VIF}\n’) 


- Output of Script 3.10: MLR-VIF.py — 
VIF: 
[29.37890286 1.11277075 1.47761777 1.34929556] 


4. Multiple Regression Analysis: Inference 


Section 4.1 of Wooldridge (2019) adds assumption MLR.6 (normal distribution of the error term) 
to the previous assumptions MLR.1 through MLR.5. Together, these assumptions constitute the 
classical linear model (CLM). 


The main additional result we get from this assumption is stated in Theorem 4.1: The OLS param- 
eter estimators are normally distributed (conditional on the regressors x;,..., x). The benefit of this 
result is that it allows us to do statistical inference similar to the approaches discussed in Section 1.7 
for the simple estimator of the mean of a normally distributed random variable. 


4.1. The Test 


After the sign and magnitude of the estimated parameters, empirical research typically pays most 
attention to the results of f tests discussed in this section. 


4.1.1. General Setup 
An important type of hypotheses we are often interested in is of the form. 
Ho: Bj = aj (4.1) 


where a. j is some given number, very often aj — 0. For the most common case of two-tailed tests, the 
alternative hypothesis is 


Hy : Bi # a, (42) 
and for one-tailed tests it is either one of 
Hı: Bj € aj or Hı : Bj > aj. (4.3) 


These hypotheses can be conveniently tested using a f test which is based on the test statistic 


(44) 


If Ho is in fact true and the CLM assumptions hold, then this statistic has a t distribution with 
n — k — 1 degrees of freedom. 
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4.1.2. Standard Case 


Very often, we want to test whether there is any relation at all between the dependent variable y and 
a regressor x; and do not want to impose a sign on the partial effect a priori. This is a mission for the 
standard two-sided t test with the hypothetical value a; = 0, so 


Ho:Bj)=0,  Hi:fj £0, (45) 


no B 

Bj se( B) 
The subscript on the t statistic indicates that this is “the” t value for B; for this frequent version 
of the test. Under Hp, it has the t distribution with n — k — 1 degrees of freedom implying that 
the probability that |f; | > c is equal to a if c is the 1 — § quantile of this distribution. If a is our 
significance level (e.g.  — 5%), then we 


(4.6) 


reject Ho if |t | > c 


in our sample. For the typical significance level a = 5%, the critical value c will be around 2 for 
reasonably large degrees of freedom and approach the counterpart of 1.96 from the standard normal 
distribution in very large samples. 

The p value indicates the smallest value of the significance level a for which we would still reject Ho 
using our sample. So it is the probability for a random variable T with the respective t distribution 
that |T| > Ital where ta, is the value of the f statistic in our particular sample. In our two-tailed test, 
it can be calculated as 

pj, = 2: Eni (lta I) an 


where F,,_,_,(-) is the CDF of the t distribution with n — k — 1 degrees of freedom. If our software 
provides us with the relevant p values, they are easy to use: We 


reject Ho if Pj, <a. 


Since this standard case of a f test is so common, statsmodels provides us with the relevant 
t and p values directly in the summary of the estimation results we already saw in the previous 
chapter. The regression table includes for all regressors and the intercept: 
* parameter estimates and standard errors, see Section 3.1. 
* test statistics £5. from Equation 4.6 in the column t 
* respective p values Pp, from Equation 4.7 in the column P» |t | 


* respective 95% confidence interval from Equation 4.8 in columns [0.025 and 0.975] (see 
Section 4.2) 


Wooldridge, Example 4.3: Determinants of College GPA 


We have repeatedly used the data set ceA1 in Chapter 3. This example uses three regressors and 
estimates a regression model of the form 


COlGPA = Bo + B1 ` hsGPA + Ba - ACT + B3: skipped + u. 


For the critical values of the t tests, using the normal approximation instead of the exact t distribution 
with n — k — 1 = 137 d.f. doesn't make much of a difference: 
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~ Script 4.1: Example-4-3-cv.py 
import scipy.stats as stats 
import numpy as np 


# CV for alpha=5% and 1% using the t distribution with 137 d.f.: 
alpha - np.array([0.05, 0.01]) 

cv t - stats.t.ppf(1 - alpha / 2, 137) 

print(f'cv t: (cv t)Wn') 


# CV for alpha=5% and 1% using the normal approximation: 
cv n = stats.norm.ppf(1 - alpha / 2) 
print(f'cv n: (cv n)in') 


pM ———— Output of Script 4.1: Example-4-3-cv.py 
cv t: [1.97743121 2.61219198] 


cv n: [1.95996398 2.5758293 ] 


Script 4.2 (Examp1e-4-3.py) presents the standard summary which directly contains all the information 
to test the hypotheses in Equation 4.5 for all parameters. The t statistics for all coefficients except B; are 
larger in absolute value than the critical value c — 2.61 (or c — 2.58 using the normal approximation) 
for « = 1%. So we would reject Ho for all usual significance levels. By construction, we draw the same 
conclusions from the p values. 

In order to confirm that statsmode1s is exactly using the formulas of Wooldridge (2019), we next recon- 
struct the t and p values manually. We extract the coefficients (params) and standard errors (bse) from 
the regression results, and simply apply Equations 4.6 and 4.7. 


Script 42: Example-4-3.py 
import wooldridge as woo 
import statsmodels.formula.api as smf 
import scipy.stats as stats 


gpal = woo.dataWoo(’ gpal’) 


# store and display results: 

reg = smf.ols(formula=’colGPA ~ hsGPA + ACT + skipped’, data-gpal) 
results = reg. fit() 

print (f' results.summary(): \n{results.summary()}\n’) 


# manually confirm the formulas, i.e. extract coefficients and SE: 
b = results.params 
se = results .bse 


# reproduce t statistic: 
tstat = b / se 
print(f'tstat: \n{tstat}\n’) 


# reproduce p value: 
pval - 2 « stats.t.cdf(-abs(tstat), 137) 
print(f'pval: \n{pval}\n’) 
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Ou 


results. summary () 


Warnings: 
[1] Standard Errors assume 


tstat: 

Intercept 4.191039 
hsGPA 4.396260 
ACT 1.393319 
skipped -3.196840 


dtype: float64 


pval: 
[4.95026897e-05 2.19205015 


Dep. Variable ColGPA — R-square 0.234 
Model: OLS Adj. R-squared: 0.217 
Method: Least Squares F-statistic: 13.92 
Date: Tue, 12 May 2020 Prob (F-statistic): 5.65e-08 
Time: 10:37:33 Log-Likelihood: -41.501 
No. Observations: 141 AIC: 91.00 
Df Residuals: 137  BIC: 102.8 
Df Model: 3 
Covariance Type: nonrobust 

coef std err t P»|t| [0.025 0.975] 
Intercept 1.3896 0.332 4.191 0.000 0.734 2.045 
hsGPA 0.4118 0.094 4.396 0.000 0.227 0.597 
ACT 0.0147 0.011 1.393 0.166 -0.006 0.036 
skipped -0.0831 0.026 -3.197 0.002 -0.135 -0.032 
Onnibus: Durbin-Watson: 
Prob (Omnibus): Jarque-Bera (JB): 
Skew: Prob (JB) : 
Kurtosis Cond. No 


tput of Script 42: Example-4-3.py 


OLS Regression Results 


that the covariance matrix of the errors is correctly 


e-05 1.65779902e-01 1.72543113e-03] 


4.1.3. Other Hypotheses 


For a one-tailed test, the critical value c of the f test and the p values have to be adjusted appropriately. 
Wooldridge (2019) provides a general discussion in Section 4.2. For testing the null hypothesis 
Ho : Bj = aj, the tests for the three common alternative hypotheses are summarized in Table 4.1: 


Table 4.1. One- and Two-tailed t Tests for Ho : Bj = 


Hi Bj f aj Bj > aj Bj « aj 
-5 1-a 1-« 

reject Ho if Itg,l Sé t 7 € tg < =c 

p value 2-Fnxal ltal) Finial tg) Finet) 
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Given the standard regression output like the one in Script 4.2 (Examp1e-4-3.py) including the 
p value for two-sided tests pj, we can easily do one-sided f tests for the null hypothesis Ho : Bj = 0 
in two steps: 
els Bj positive (if Hı : Bj > 0) or negative (if Hı : B; < 0)? 
— No — Do not reject Hp since this cannot be evidence against Hp. 
— Yes — The relevant p value is half of the reported Pô; 


=> Reject Ho if p = 2P; <a. 


Wooldridge, Example 4.1: Hourly Wage Equation 


We have already estimated the wage equation 
log(wage) = Bo + By -educ + B2- exper + B3: tenure +u 


in Example 3.2. Now we are ready to test Ho : B2 = 0 against H; : B» > 0. For the critical values of the t 
tests, using the normal approximation instead of the exact t distribution with n — k — 1 = 522 d.f. doesn't 
make any relevant difference: 

— Script 43: Example-4-1-cv.py |... 
import scipy.stats stats 
import numpy as np 


# CV for alpha=5% and 1% using the t distribution with 522 d.f.: 
alpha - np.array([0.05, 0.01]) 

cv t = stats.t.ppf(1 - alpha, 522) 

print(f'cv t: (cv t)Wn') 


lphaz5$ and 1% using the normal approximation: 
norm.ppf(1 - alpha) 
print(f'cv n: (cv n)Wn') 


p — — — Output of Script 43: Example 
cv. t: [1.64777794 2.33351273] 


cv n: [1.64485363 2.32634787] 


Script 4.4 (Example-4-1 . py) shows the standard regression output. The reported t statistic for the pa- 
rameter of exper is tg, = 2.391 which is larger than the critical value c = 2.33 for the significance level 


& = 1%, so we reject Ho. By construction, we get the same answer from looking at the p value. Like 
always, the reported Pj, value is for a two-sided test, so we have to divide it by 2. The resulting value 


p= giz = 0.0085 < 0.01, so we reject Ho using an a = 1% significance level. 
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import wooldridge as woo 
import numpy as np 
import statsmodels.formula.api as smf 


wagel = woo.dataWoo (' wagel’ ) 


results = reg.fit() 


Script 4.4: Example-4-1.py 


reg = smf.ols(formula-'np.log(wage) - educ + exper + tenure’, data-wagel) 


print (f’results.summary(): \n{results.summary()}\n’) 


results.summary (): 


Dep. Variable np. log (wage) 


Model: OLS 
Method: Least Squares 
Date: Tue, 12 May 2020 
Time: 10:37:35 
No. Observations: 526 
Df Residuals: 522 
Df Model: 3 


nonrobust 


Covariance Type 


coef std err 


Intercept 0.2844 
educ 0.0920 
exper 0.0041 
tenure 0.0221 
Omnibus: 

Prob (Omnibus): 

Skew: 

Kurtosis: 


Warnings: 
[1] Standard Errors assume that the covariance matrix of the errors is correctly 


Output of Script 4.4: Example-4-1.py 


OLS Regression Results 


0.316 


R-square: 
Adj. R-squared: 0.312 
F-statistic: 80.39 
Prob (F-statistic): 9.13e-43 
Log-Likelihood: -313.55 
AIC: 635.1 
BIC: 652.2 


t P»|t| [0.025 0.975] 
.729 0.007 0.080 0.489 
.555 0.000 0.078 0.106 
.391 0.017 0.001 0.008 
.133 0.000 0.016 0.028 
Durbin-Watson: 1.769 
Jarque-Bera (JB): 20.941 
Prob(JB): 2.84e-05 
Cond. No 135 
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4.2. Confidence Intervals 


We have already looked at confidence intervals (CI) for the mean of a normally distributed random 
variable in Sections 1.7 and 1.9.3. CI for the regression parameters are equally easy to construct and 
closely related to t tests. Wooldridge (2019, Section 4.3) provides a succinct discussion. The 95% 
confidence interval for parameter B; is simply 


Bj c- se(Bj), (48) 


where c is the same critical value for the two-sided f test using a significance level a = 5%. 
Wooldridge (2019) shows examples of how to manually construct these CI. 

statsmodels provides the 95% confidence intervals for all parameters in the regression table. 
If you use the method con£ int on the object with the regression results, you can compute other 
significance levels. Script 4.5 (Examp1e-4-8.py) demonstrates the procedure. 


Wooldridge, Example 4.8: Model of R&D Expenditures 


We study the relationship between the R&D expenditures of a firm, its size, and the profit margin for a 
sample of 32 firms in the chemical industry. The regression equation is 


log(rd) = Bo + B1 log(sales) + By + profmarg + u. 


Script 4.5 (Example-4-8.py) presents the regression results as well as the 95% and 99% Cl. See 
Wooldridge (2019) for the manual calculation of the Cl and comments on the results. 


Script 4.5: Example-4-8 . py 
import wooldridge as woo 
import numpy as np 
import statsmodels.formula.api as smf 


rdchem = woo.dataWoo ('rdchem') 


# OLS regression: 

reg = smf.ols(formula-'np.log(rd) ~ np.log(sales) + profmarg’, data=rdchem) 
results = reg. fit() 

print(f'results.summary(): \n{results.summary()}\n’) 


# 95% CI: 
CI95 = results.conf_int (0.05) 
print(f'CI95: \n{CI95}\n’) 


# 99% CI: 
CI99 = results.conf_int (0.01) 
print(f'CI99: \n{CI99}\n’) 
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Output of Script 4.5: Example-4-8.py 


OLS Regression Results 


results.summary (): 


Dep. Variable np.log(rd)  R-square 0.918 
Model: OLS Adj. R-squared: 0.912 
Method: Least Squares — F-statistic: 162.2 
Date: Tue, 12 May 2020 Prob (F-statistic): 1.79e-16 
Time: 10:37:37  Log-Likelihood: -22.511 
No. Observations: 32 AIC: 51.02 
Df Residuals: 29  BIC: 55.42 
Df Model: 2 
Covariance Type: nonrobust 

coef std err t P»Itl [0.025 0.975] 
Intercept -4.3783 0.468 -9.355 0.000 -5.335 -3.421 
np. log (sales) 1.0842 0.060 18.012 0.000 0.961 1.207 
profmarg 0.0217 0.013 1.694 0.101 -0.004 0.048 
Omnibu 0.670 Durbin-Watso 1.859 
Prob (Omnibus): 0.715  Jarque-Bera (JB): 0.671 
Skew: 0.308 Prob (JB): 0.715 
Kurtosis 2.649 Cond. No. 70.6 


Warnings: 
[1] Standard Errors assume that the covariance matrix of the errors is correctly 


CI95: 

0 1 
Intercept 75.335478 -3.421068 
np.log(sales) 0.961107 1.207332 
profmarg -0.004488 0.047799 
CI99: 

0 ES 


Intercept -5.668313 -3.088234 
np.log(sales) 0.918299 1.250141 
profmarg -0.013578 0.056890 
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4.3. Linear Restrictions: F Tests 


Wooldridge (2019, Sections 4.4 and 4.5) discusses more general tests than those for the null hypothe- 
ses in Equation 4.1. They can involve one or more hypotheses involving one or more population 
parameters in a linear fashion. 

We follow the illustrative example of Wooldridge (2019, Section 4.5) and analyze major league 
baseball players’ salaries using the data set MLB1 and the regression model 


log(salary) = By + pı -years + f2: gamesyr + B3: bavg + B4: hrunsyr- Bs: rbisyr +u. (49) 


We want to test whether the performance measures batting average (bavg), home runs per year 
(hrunsyr), and runs batted in per year (rbisyr) have an impact on the salary once we control 
for the number of years as an active player (years) and the number of games played per year 
(gamesyr). So we state our null hypothesis as Ho : B3 = 0, B4 = 0, Bs = 0 versus H; : Hp is false, i.e. 
at least one of the performance measures matters. 

The test statistic of the F test is based on the relative difference between the sum of squared 
residuals in the general (unrestricted) model and a restricted model in which the hypotheses are 
imposed SSR,, and SSR,, respectively. In our example, the restricted model is one in which bavg, 
hrunsyr, and rbisyr are excluded as regressors. If both models involve the same dependent 
variable, it can also be written in terms of the coefficient of determination in the unrestricted and the 
restricted model R2, and R?, respectively: 


SSR,- SSRur n-k-1_ R}, -R n-k-1 


PER 4 1-R, q^ (4.10) 


where q is the number of restrictions (in our example, q = 3). Intuitively, if the null hypothesis is 
correct, then imposing it as a restriction will not lead to a significant drop in the model fit and the 
F test statistic should be relatively small. It can be shown that under the CLM assumptions and the 
null hypothesis, the statistic has an F distribution with the numerator degrees of freedom equal to q 
and the denominator degrees of freedom of n — k — 1. Given a significance level a, we will reject Ho 
if F > c, where the critical value c is the 1 — a quantile of the relevant F,,,_;— distribution. In our 
example, n = 353,k = 5,q = 3. So with « = 1%, the critical value is 3.84 and can be calculated using 
the £.ppf function in scipy.stats as 


f.ppf(1 - 0.01, 3, 347) 


Script 4.6 (F-Test . py) shows the calculations for this example. The result is F = 9.55 > 3.84, so 
we clearly reject Ho. We also calculate the p value for this test. It is p = 4.47 - 107% = 0.00000447, so 
we reject Ho for any reasonable significance level. 
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Script 4.6: F-Test .py 
import wooldridge as woo 

import numpy as np 

import statsmodels.formula.api as smf 

import scipy.stats as stats 


mlbl = woo.dataWoo('mlbl') 
n = mlbl.shape[0] 


# unrestricted OLS regression: 

reg ur = smf.ols( 
formula-'np.log(salary) ~ years + gamesyr + bavg + hrunsyr + rbisyr', 
data-mlbl) 

fit ur = reg ur.fit() 

r2 ur - fit ur.rsquared 

print(f'r2 ur: {r2_ur}\n’) 


# restricted OLS regression: 

reg r = smf.ols(formula-'np.log(salary) ~ years + gamesyr’, data-mlbl) 
fit r = reg r.fit() 

r2 r - fit r.rsquared 

print(f'r2 r: (r2 r)Wn') 


4 F statistic: 
fstat - (r2 ur - r2 r) / (1- r2 ur) * (n- 6) / 3 
print(f'fstat: (fstat)Wn') 


# CV for alpha=1% using the F distribution with 3 and 347 d.f.: 
cv = stats.f.ppf(1 - 0.01, 3, 347) 
print(f'cv: {cv}\n’) 


# p value = 1-cdf of the appropriate F distribution: 
fpval - 1 - stats.f.cdf(fstat, 3, 347) 
print(f'fpval: (fpval)Wn') 


m Output of Script 46: F-Test.py —— ————————————— ————» 
r2 ur: 0.6278028485187442 


r2 r: 0.5970716339066895 
fstat: 9.550253521951914 
cv: 3.838520048496057 


fpval: 4.473708139829391e-06 


It should not be surprising that there is a more convenient way to do this. The module 
statsmodels provides a command £ test which is well suited for these kinds of tests. Given 
the object with regression results, for example results, an F test is conducted with 


hypotheses = ['var namel = 0’, 'var name2 = 0’, ...] 
ftest = results.f test (hypotheses) 


where hypotheses collects null hypothesis to be tested. It is a list of length q where each re- 
striction is described as a text in which the variable name takes the place of its parameter. In our 
example, Hy is that the three parameters of bavg, hrunsyr, and rbisyr are all equal to zero, 
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which translates as hypotheses = ['bavg = 0’, ‘hrunsyr = 0’, 'rbisyr = 0']. Script 
47 (F-Test-Automat ic. py) implements this for the same test as the manual calculations done in 
Script 4.6 (F-Test .py) and results in exactly the same F statistic and p value. 


Script 47: F-Test-Automatic.py 


import wooldridge as woo 
import numpy as np 
import statsmodels.formula.api as smf 


mlbl = woo.dataWoo('mlbl') 


# OLS regression: 

reg = smf.ols( 
formula-'np.log(salary) ~ years + gamesyr + bavg + hrunsyr + rbisyr', 
data-mlbl) 

results - reg.fit() 


# automated F test: 

hypotheses = ['bavg = 0’, 'hrunsyr = 0’, 'rbisyr = 0'] 
ftest = results. f test (hypotheses) 

fstat = ftest.statistic[0] [0] 

fpval = ftest.pvalue 


print (f/fstat: {fstat}\n’) 
print(f'fpval: {fpval}\n’) 


E — — — — — —— Output of Script 4.7: F-Test-Automatic.py 
fstat: 9.550253521951783 


fpval: 4.473708139839581e-06 


This function can also be used to test more complicated null hypotheses. For example, sup- 
pose a sports reporter claims that the batting average plays no role and that the number of home 
runs has twice the impact as the number of runs batted in. This translates (using variable names 
instead of numbers as subscripts) as Ho : Bpavg = O, Bhrunsyr = 2+ Broisyr- For Python, we trans- 
late itas hypotheses = ['bavg = 0’, 'hrunsyr = 2«rbisyr']. The output of Script 4.8 


(F-Test-Automatic2.py) shows the results of this test. The p value is p — 0.6, so we cannot reject 
Ho. 
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Script 4.8: F-Test-Automatic2.py 
import wooldridge as woo 


import numpy as np 
import statsmodels.formula.api as smf 


mlbl = woo.dataWoo ('mlbl') 


# OLS regression: 

reg = smf.ols( 
formula-'np.log(salary) ~ years + gamesyr + bavg + hrunsyr + rbisyr', 
data-mlbl) 

results - reg.fit() 


# automated F test: 
hypotheses = ['bavg = 0’, 'hrunsyr = 2«rbisyr'] 
ftest = results.f test (hypotheses) 

fs ftest .statistic[0] [0] 

fpval = ftest .pvalue 


print (f/fstat: {fstat}\n’) 
print (f/fpval: {fpval}\n’) 


p Output of Script 4.8: F-Test-Automatic2.py 
fstat: 0.5117822576247235 


fpval: 0.5998780329146685 


Both the most important and the most straightforward F test is the one for overall significance. 
The null hypothesis is that all parameters except for the constant are equal to zero. If this null 
hypothesis holds, the regressors do not have any joint explanatory power for y. The results of such 
a test are automatically included in the upper part of the summary output as F-statistic (F 
statistic) and Prob(F-statistic) (p value). As an example, see Script 4.5 (Example-4-8.py). 
The null hypothesis that neither the sales nor the margin have any relation to R&D spending is 
clearly rejected with an F statistic of 162.2 and a p value smaller than 10 15. 


5. Multiple Regression Analysis: OLS 
Asymptotics 


Asymptotic theory allows us to relax some assumptions needed to derive the sampling distribution 
of estimators if the sample size is large enough. For running a regression in a software package, it 
does not matter whether we rely on stronger assumptions or on asymptotic arguments. So we don’t 
have to learn anything new regarding the implementation. 

Instead, this chapter aims to improve on our intuition regarding the workings of asymptotics by 
looking at some simulation exercises in Section 5.1. Section 5.2 briefly discusses the implementation 
of the regression-based Lagrange multiplier (LM) test presented by Wooldridge (2019, Section 5.2). 


5.1. Simulation Exercises 


In Section 2.7, we already used Monte Carlo Simulation methods to study the mean and variance 
of OLS estimators under the assumptions SLR.1-SLR.5. Here, we will conduct similar experiments 
but will look at the whole sampling distribution of OLS estimators similar to Section 1.9.2 where 
we demonstrated the central limit theorem for the sample mean. Remember that the sampling 
distribution is important since confidence intervals, t and F tests and other tools of inference rely on 
it. 

Theorem 4.1 of Wooldridge (2019) gives the normal distribution of the OLS estimators (conditional 
on the regressors) based on assumptions MLR.1 through MLR.6. In contrast, Theorem 5.2 states that 
asymptotically, the distribution is normal by assumptions MLR.1 through MLR.5 only. Assumption 
MLR.6 - the normal distribution of the error terms — is not required if the sample is large enough to 
justify asymptotic arguments. 

In other words: In small samples, the parameter estimates have a normal sampling distribution 
only if 

* the error terms are normally distributed and 
* we condition on the regressors. 

To see how this works out in practice, we set up a series of simulation experiments. Section 5.1.1 
simulates a model consistent with MLR.1 through MLR.6 and keeps the regressors fixed. Theory 
suggests that the sampling distribution of Ê is normal, independent of the sample size. Section 
5.1.2 simulates a violation of assumption MLR.6. Normality of Ê only holds asymptotically, so for 
small sample sizes we suspect a violation. Finally, we will look closer into what “conditional on the 
regressors” means and simulate a (very plausible) violation of this in Section 5.1.3. 


5.1.1. Normally Distributed Error Terms 
Script 5.1 (Sim-Asy-OLS-norm. py) draws 10000 samples of a given size (which has to be stored 


in variable n before) from a population that is consistent with assumptions MLR.1 through MLR.6. 
The error terms are specified to be standard normal. The slope estimate B is stored for each of the 
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generated samples in the array b1. For a more detailed discussion of the implementation, see Section 
2.7.2 where a very similar simulation exercise is introduced. 


Script 5.1: Sim-Asy-OLS-norm.py 
import numpy as np 
import pandas as pd 
import statsmodels.formula.api as smf 
import scipy.stats as stats 


# set the random seed: 
np.random.seed(1234567) 


# set sample size and number of simulations: 
n= 100 
r - 10000 


# set true parameters: 
betad = 1 

betal = 0.5 

sx =1 

ex =4 


# initialize bl to store results later: 
bl = np.empty (r) 


# draw a sample of x, fixed over replications: 
x = stats.norm.rvs(ex, sx, size=n) 


# repeat r times: 

for i in range(r): 
# draw a sample of u (std. normal): 
us .norm.rvs(0, 1, size=n) 
y = beta0 + betal + x +u 
df = pd.DataFrame(('y': y, ‘x’: x}) 
# 


estimate conditional OLS: 
reg = smf.ols(formula-'y ~ x’, data-df) 
results = reg. fit () 
bl[i] = results.params['x'] 


This code was run for different sample sizes. The density estimate together with the corresponding 
normal density are shown in Figure 5.1. Not surprisingly, all distributions look very similar to the 
normal distribution — this is what Theorem 4.1 predicted. Note that the fact that the sampling 
variance decreases as n rises is only obvious if we pay attention to the different scales of the axes. 


5.1.2. Non-Normal Error Terms 


The next step is to simulate a violation of assumption MLR.6. In order to implement a rather drastic 
violation of the normality assumption similar to Section 1.9.2, we implement a "standardized" x? 
distribution with one degree of freedom. More specifically, let v be distributed as x7). Because 
this distribution has a mean of 1 and a variance of 2, the error term u = E has à mean of 0 
and a variance of 1. This simplifies the comparison to the exercise with the stahdard normal errors 
above. Figure 5.2 plots the density functions of the standard normal distribution used above and the 
“standardized” x^ distribution. Both have a mean of 0 and a variance of 1 but very different shapes. 

Script 5.2 (Sim-Asy-OLS-chisq.py) implements a simulation of this model and is listed in 
the appendix (p. 350). The only line of code we changed compared to the previous Script 5.1 
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Figure 5.1. Density of Ê, with Different Sample Sizes: Normal Error Terms 
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Figure 5.2. Density Functions of the Simulated Error Terms 
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(Sim-Asy-OLS-norm. py) is the sampling of u where we replace drawing from a standard normal 
distribution using u = stats.norm.rvs(0, 1, size=n) with sampling from the standardized 
xy distribution with 


[u = (stats.chi2.rvs(1, size=n) - 1) / np.sqrt(2) 


For each of the same sample sizes used above, we again estimate the slope parameter for 10000 
samples. The densities of f; are plotted in Figure 5.3 together with the respective normal distribu- 
tions with the corresponding variances. For the small sample sizes, the deviation from the normal 
distribution is strong. Note that the dashed normal distributions have the same mean and variance. 
The main difference is the kurtosis which is larger than 8 in the simulations for n = 5 compared to 
the normal distribution for which the kurtosis is equal to 3. 

For larger sample sizes, the sampling distribution of B1 converges to the normal distribution. For 
n = 100, the difference is much smaller but still discernible. For n = 1000, it cannot be detected 
anymore in our simulation exercise. How large the sample needs to be depends among other things 
on the severity of the violations of MLR.6. If the distribution of the error terms is not as extremely 
non-normal as in our simulations, smaller sample sizes like the rule of thumb n = 30 might suffice 
for valid asymptotics. 
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Figure 5.3. Density of Ê, with Different Sample Sizes: Non-Normal Error Terms 
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5.1.3. (Not) Conditioning on the Regressors 


There is a more subtle difference between the finite-sample results regarding the variance (Theorem 
3.2) and distribution (Theorem 4.1) on one hand and the corresponding asymptotic results (Theorem 
5.2). The former results describe the sampling distribution “conditional on the sample values of the 
independent variables”. This implies that as we draw different samples, the values of the regressors 
X1,-..,Xy remain the same and only the error terms and dependent variables change. 

In our previous simulation exercises in Scripts like 2.16 (SLR-Sim-Model-Condx.py), 5.1 
(Sim-Asy-OLS-norm.py), and 5.2 (Sim-Asy-OLS-chisq.py), this is implemented by making 
random draws of x outside of the simulation loop. This is a realistic description of how data is gen- 
erated only in some simple experiments: The experimenter chooses the regressors for the sample, 
conducts the experiment and measures the dependent variable. 

In most applications we are concerned with, this is an unrealistic description of how we obtain 
our data. If we draw a sample of individuals, both their dependent and independent variables differ 
across samples. In these cases, the distribution "conditional on the sample values of the independent 
variables" can only serve as an approximation of the actual distribution with varying regressors. For 
large samples, this distinction is irrelevant and the asymptotic distribution is the same. 

Let's see how this plays out in an example. Script 5.3 (Sim-Asy-OLS-uncond.py) differs from 
Script 5.1 (Sim-Asy-OLS-norm.py) only by moving the generation of the regressors into the loop in 
which the 10 000 samples are generated. This is inconsistent with Theorem 4.1, so for small samples, 
we don't know the distribution of Bj. Theorem 5.2 is applicable, so for (very) large samples, we 
know that the estimator is normally distributed. 

Figure 5.4 shows the distribution of the 10000 estimates generated by Script 5.3 
(Sim-Asy-OLS-uncond.py) for n — 5,10,100, and 1000. As we expected from theory, the 
distribution is (close to) normal for large samples. For small samples, it deviates quite a bit. The 
kurtosis is 8.7 for a sample size of n = 5 which is far away from the kurtosis of 3 of a normal 
distribution. 
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Script 5.3: Sim-Asy-OLS-uncond.py 
import numpy as np 
import pandas as pd 
import statsmodels.formula.api as smf 
import scipy.stats as stats 


# set the random seed: 
np. random. seed (1234567) 


# set sample size and number of simulations: 
n = 100 
r = 10000 


# set true parameters: 
betad = 1 
betal 


£ 
4 


# initialize b1 to store results later: 
bl = np.empty(r) 


# repeat r time: 
for i in range(r): 
# draw a sample of x, varying over replications: 
x = stats.norm.rvs(ex, sx, size=n) 


# draw a sample of u (std. normal): 
u = stats.norm.rvs(0, 1, size=n) 
y = beta0 + betal + x + u 
df - pd.DataFrame(('y': y, 'x 


# estimate unconditional OLS: 

reg = smf.ols(formula-'y ~ x’, data-df) 
results = reg. fit() 

bl[i] = results.params['x'] 
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Figure 5.4. Density of Ê with Different Sample Sizes: Varying Regressors 
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5.2. LM Test 


As an alternative to the F tests discussed in Section 4.3, LM tests for the same sort of hypotheses can 
be very useful with large samples. In the linear regression setup, the test statistic is 


LM =n- R3, 


where n is the sample size and R2 is the usual R? statistic in a regression of the residual ii from the 
restricted model on the unrestricted set of regressors. Under the null hypothesis, it is asymptotically 
distributed as X with q denoting the number of restrictions. Details are given in Wooldridge (2019, 
Section 5.2). 

The implementation in statsmodels is straightforward if we remember that the residuals can be 
obtained with the resid attribute. 


Wooldridge, Example 5.3: Economic Model of Crime 


We analyze the same data on the number of arrests as in Example 3.5. The unrestricted regression 
model equation is 


narr86 = Bo + Bipcnv + Baavgsen + Bstottime + B4ptime86 + Bsqemp86 +u. 


The dependent variable narr86 reflects the number of times a man was arrested and is explained by 
the proportion of prior arrests (pcnv), previous average sentences (avgsen), the time spend in prison 
before 1986 (tott ime), the number of months in prison in 1986 (pt ime8 6), and the number of quarters 
unemployed in 1986 (qemp86). 
The joint null hypothesis is 

Ho: B2 = B3 = 0, 
so the restricted set of regressors excludes avgsen and tottime. Script 5.4 (Examp1e-5-3.py) shows 
an implementation of this LM test. The restricted model is estimated and its residuals utilde-ü are 
calculated. They are regressed on the unrestricted set of regressors. The R? from this regression is 
0.001494, so the LM test statistic is calculated to be around LM = 0.001494 - 2725 = 4.071. This is smaller 
than the critical value for a significance level of « = 10%, so we do not reject the null hypothesis. We 
can also easily calculate the p value using the x? CDF chi2. caf. It turns out to be 0.1306. 
The same hypothesis can be tested using the F test presented in Section 4.3 using the command 
f£ test. In this example, it delivers the same p value up to three digits. 
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Script 5.4: Example-5-3.py 


import wooldridge as woo 
import statsmodels.formula.api as smf 
import scipy.stats as stats 


crimel = woo.dataWoo('crimel') 


4 1. estimate restricted model: 

reg r = smf.ols(formula-'narr86 ~ pcnv + ptime86 + qemp86', data-crimel) 
fit r - reg r.fit() 

r2 r - fit r.rsquared 

print(f'r2 r: (r2 r)Wn') 


# 2. regression of residuals from restricted model: 

crimel['utilde'] - fit r.resid 

reg LM = smf.ols(formula-'utilde ~ pcnv + ptime86 + qemp86 + avgsen + tottime', 
data-crimel) 

fit IM = reg IM.fit() 

r2 LM - fit LM.rsquared 

print(f/r2 IM: (r2 IM)Wn') 


# 3. calculation of LM test statistic: 
IM = r2 LM + fit LM.nobs 
print(f'LM: (LM)in') 


# 4. critical value from chi-squared distribution, alpha=10%: 
cv = stats.chi2.ppf(1 - 0.10, 2) 
print(f'cv: (cv)Wn') 


4 5. p value (alternative to critical value): 
pval = 1 - stats.chi2.cdf(LM, 2) 
print(f'pval: {pval}\n’) 


# 6. compare to F-test: 

reg = smf.ols(formula-'narr86 ~ pcnv + ptime86 + qemp86 + avgsen + tottime’, 
data-crimel) 

reg.fit() 

['avgsen - 0', 'tottime - 0'] 

ults.f test (hypotheses) 

ftest .statistic[0] [0] 

fpval = ftest.pvalue 

print (f/fstat: (fstat)Wn') 

print (f’fpval: (fpval)Wn') 


results 


—— Output of Script 5.4: Example-5-3.py — 
r2 r: 0.04132330770123016 


r2 LM: 0.0014938456737880745 
LM: 4.070729461072503 

cv: 4.605170185988092 

pval: 0.13063282803261256 


fstat: 2.0339215584351407 


fpval: 0.13102048172760739 


6. Multiple Regression Analysis: Further 
Issues 


In this chapter, we cover some issues regarding the implementation of regression analyses. Section 
6.1 discusses more flexible specification of regression equations such as variable scaling, standard- 
ization, polynomials and interactions. They can be conveniently included in the formula and used 
in the stat smodels OLS estimation. Section 6.2 is concerned with predictions and their confidence 
and prediction intervals. 


6.1. Model Formulae 


If we run a regression in statsmodels using a syntax like 


[sm£.o1s 0y ~ xl + x2 + x3’, data=sample) 


the expression y ~ x1 + x2 + x3 is referred to as a model formula. It is a compact symbolic 
way to describe our regression equation. The dependent variable is separated from the regressors 
by a “~” and the regressors are separated by a “+” indicating that they enter the equation in a linear 
fashion. A constant is added by default. Such formulae can be specified in more complex ways 
to indicate different kinds of regression equations. We will cover the most important ones in this 
section. 


6.1.1. Data Scaling: Arithmetic Operations Within a Formula 


Wooldridge (2019) discusses how different scaling of the variables in the model affects the parameter 
estimates and other statistics in Section 6.1. As an example, a model relating the birth weight to 
cigarette smoking of the mother during pregnancy and the family income. The basic model equation 
is 

bwght = Bp + B1cigs + B2faminc+u (6.1) 


which translates into formula syntax as bwght ~ cigs + faminc. 
If we want to measure the weight in pounds rather than ounces, there are two ways to implement 
different rescaling in Python. We can 
* Define a different variable like bwghtlbs = bwght/16 and use this variable in the formula: 
bwghtlbs ~ cigs + faminc 
* Specify this rescaling directly in the formula: I (bwght/16) ~ cigs + faminc 
The latter approach can be more convenient. Note that the I(...) brackets describe any parts of 
the formula in which we specify arithmetic transformations. 
If we want to measure the number of cigarettes smoked per day in packs, we could again define a 
new variable packs = cigs/20 and use it as a regressor or simply specify the formula bwght ~ 
I(cigs/20) * faminc. Here, the importance to use the I function is easy to see. If we specified 
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the formula bwght ~ I(cigs/20 + faminc) instead, we would have a (nonsense) model with 
only one regressor: the sum of the packs smoked and the income. 

Script 6.1 (Data-Scaling.py) demonstrates these features. As discussed in Wooldridge (2019, 
Section 6.1), dividing the dependent variable by 16 changes all coefficients by the same factor q 
and dividing a regressor by 20 changes its coefficient by the factor 20. Other statistics like R? are 
unaffected. 

Script 6.1: Data-Scaling.py 
import wooldridge as woo 


import pandas as pd 
import statsmodels.formula.api as smf 


bwght = woo.dataWoo ('bwght') 


# regress and report coefficients: 
reg = smf.ols(formula-'bwght ~ cigs + faminc’, data-bwght) 
results = reg. fit() 


# weight in pounds, manual way: 

bwght['bwght lbs'] = bwght['bwght'] / 16 

reg lbs = smf.ols(formula-'bwght lbs ~ cigs + faminc’, data=bwght) 
results lbs = reg lbs.fit() 


# weight in pounds, direct way: 
reg lbs2 = smf.ols(formula-'I(bwght/16) ~ cigs + faminc', data-bwght) 
results lbs2 = reg lbs2.fit() 


# packs of cigarettes: 
reg packs = smf.ols(formula-'bwght ~ I(cigs/20) + faminc’, data-bwght) 
results packs = reg packs.fit() 


# compare results: 
table - pd.DataFrame(('b': round(results.params, 4), 

'b lbs': round(results lbs.params, 4), 

'b lbs2': round(results lbs2.params, 4), 

b packs': round(results packs.params, 4)}) 
print(f'table: \n{table}\n’) 


- Output of Script 6.1: Data-Scaling.py 


table: 

b b lbs b lbs2 b packs 
I(cigs / 20) NaN NaN NaN -9.2682 
Intercept 116.9741 7.3109 7.3109 116.9741 
cigs -0.4634 -0.0290 -0.0290 NaN 
faminc 0.0928 0.0058 0.0058 0.0928 


6.1.2. Standardization: Beta Coefficients 


A specific arithmetic operation is the standardization. A variable is standardized by subtracting its 
mean and dividing by its standard deviation. For example, the standardized dependent variable y 
and regressor x; are 


T as | 2n-üu 
E and 2x, = sd(xi) (6.2) 

If the regression model only contains standardized variables, the coefficients have a special inter- 
pretation. They measure by how many standard deviations y changes as the respective independent 
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variable increases by one standard deviation. Inconsistent with the notation used here, they are some- 
times referred to as beta coefficients. 

In Python, we can use the same type of arithmetic transformations as in Section 6.1.1 to subtract 
the mean and divide by the standard deviation. It can be done more conveniently by defining and 
using a function scale directly for all variables we want to standardize. Defining a function was 
introduced in Section 1.8.3 and Script 6.2 (Examp1e-6-1.py) demonstrates the use of scale in the 
context of a regression. 


Wooldridge, Example 6.1: Effects of Pollution on Housing Prices 


We are interested in how air pollution (nox) and other neighborhood characteristics affect the value of 
a house. A model using standardization for all variables is expressed in a formula as 


price sc ~ 0 + nox sc + crime sc + rooms sc + dist sc + stratio sc 
with variable sc denoting the scaled version of variable. The output of Script 6.2 (Examp1e-6-1.py) 


shows the parameter estimates of this model. The house price drops by 0.34 standard deviations as the 
air pollution increases by one standard deviation. 


Script 6.2: Example-6-1.py 
import wooldridge as woo 
import pandas as pd 
import numpy np 
import statsmodels.formula.api as smf 


# define a function for the standardization: 
def scale (x) : 
x mean = np.mean (x) 
x_var = np.var(x, ddof=1) 
x scaled = (x - x mean) / np.sqrt (x_var) 
return x scaled 


# standardize and estimate: 
hprice2 = woo.dataWoo(’hprice2’) 

hprice2['price sc'] = scale(hprice2[’ price’ ]) 
hprice2['nox sc'] = scale(hprice2[’nox’ ]) 
hprice2['crime sc'] = scale(hprice2[’ crime’ ]) 
hprice2['rooms sc'] = scale(hprice2[’ rooms’ ]) 
hprice2['dist sc'] = scale(hprice2[’dist’]) 
hprice2['stratio sc'] = scale (hprice2['stratio']) 


reg = smf.ols( 
formula-'price sc ~ 0 + nox sc + crime sc + rooms sc + dist sc + stratio sc', 


data-hprice2) 
results - reg.fit() 


# print regression table: 
table - pd.DataFrame(('b': round(results.params, 4), 
'se': round(results.bse, 4), 
't': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4))) 
print(f'table: \n{table}\n’) 
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Output of Script 6.2: Example-6-1.py — 


table: 

b se t pval 
nox sc -0.3404 0.0445 -7.6511 0.0 
crime sc -0.1433 0.0307 -4.6693 0.0 
rooms sc 0.5139 0.0300 17.1295 0.0 
dist. sc -0.2348 0.0430 -5.4641 0.0 
stratio sc -0.2703 0.0299 -9.0274 0.0 


6.1.3. Logarithms 


We have already seen in Section 2.4 that we can include the numpy function log directly in formulas 
to represent logarithmic and semi-logarithmic models. A simple example of a partially logarithmic 
model and its formula would be 


log(y) = Bo + Bilog(xi) + Boxa + u (6.3) 


which can be expressed as np.log(y) ~ np.log(x1) + x2. 

Script 6.3 (Formula-Logarithm.py) shows this again for the house price example. As the air 
pollution nox increases by one percent, the house price drops by about 0.72 percent. As the number 
of rooms increases by one, the value of the house increases by roughly 30.6%. Wooldridge (2019, 
Section 6.2) discusses how the latter value is only an approximation and the actual estimated effect 
is (exp (0.306) — 1) = 0.358 which is 35.8%. 
m~~ Script 6.3: Formula-Logarithm.py —— 
import wooldridge as woo 

import numpy as np 


import pandas as pd 
import statsmodels.formula.api as smf 


hprice2 = woo.dataWoo ('hprice2') 


reg = smf.ols(formula-'np.log(price) ~ np.log(nox) + rooms’, data-hprice2) 
results - reg.fit() 


ion table: 


: round (results.b: 
round(results.tvalues, 4), 

'pval': round(results.pvalues, 4))) 
print(f'table: \n{table}\n’) 


tr 


[— — — ————— — Output of Script 6.3: Formula-Logarithm.py ——————_______ 
table: 


b se t pval 
Intercept 9.2337 0.1877 49.1835 0.0 
np.log(nox) -0.7177 0.0663 -10.8182 0.0 
rooms 0.3059 0.0190 16.0863 0.0 


6.1.4. Quadratics and Polynomials 


Specifying quadratic terms or higher powers of regressors can be a useful way to make a model more 
flexible by allowing the partial effects or (semi-)elasticities to decrease or increase with the value of 
the regressor. 
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Instead of creating additional variables containing the squared value of a regressor, in Python we 
can simply add I (x**2) to a formula. Higher order terms are specified accordingly. A simple cubic 
model and its corresponding formula are 


y = Bo + Bix + Ba! + Bax? +u (64) 
which translates to y ~ x + I(x**2) + I(x««3) in formula syntax. 


For nonlinear models like this, it is often useful to get a graphical illustration of the effects. Section 
6.2.2 shows how to conveniently generate these. 


Wooldridge, Example 6.2: Effects of Pollution on Housing Prices 


This example of Wooldridge (2019) demonstrates the combination of logarithmic and quadratic speci- 
fications. The model for house prices is 


log(price) = Bo + Bi log(nox) + B2 log(dist) + Barooms + Barooms? + Psstratio+u. 


Script 6.4 (Examp1e-6-2.py) implements this model and presents detailed results including t statistics 
and their p values. The quadratic term of rooms has a significantly positive coefficient B4 implying 
that the semi-elasticity increases with more rooms. The negative coefficient for rooms and the positive 
coefficient for rooms? imply that for “small” numbers of rooms, the price decreases with the number 
of roomsiond for “large” values, it increases. The number of rooms implying the smallest price can be 
found as 


«= as 
rooms* = —— = 44, 
2p. 


—À —— Script 6.4: Example-6-2.py — 
import wooldridge as woo 

import numpy as np 

import panda. pd 

import statsmodels.formula.api as smf 


hprice2 = woo.dataWoo(’hprice2’) 


np.log(price) ~ np.log (nox) +np.log (dist) +rooms+I (rooms««2)+stratio’, 
iprice2) 
reg.fit() 


results 


# print regression table: 

table - pd.DataFrame(('b': round(results.params, 4), 
'se': round(results.bse, 4), 
't': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4))) 

print(f'table: \n{table}\n’) 


1We need to find rooms* to minimize Bsrooms + B4rcoms?. Setting the first derivative B3 + 2B4rooms equal to zero and 
solving for zooms delivers the result. 
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Output of Script 6.4: Example-6-2.py 


table: 

b se t pval 
Intercept 13.3855 0.5665 23.6295 0.0000 
np.log (nox) -0.9017 0.1147 -7.8621 0.0000 
np.log(dist) -0.0868 0.0433 -2.0051 0.0455 
rooms -0.5451 0.1655 -3.2946 0.0011 
I(rooms ** 2) 0.0623 0.0128 4.8623 0.0000 
stratio -0.0476 0.0059 -8.1293 0.0000 


6.1.5. Hypothesis Testing 


A natural question to ask is whether a regressor has additional statistically significant explanatory 
power in a regression model, given all the other regressors. In simple model specifications, this 
question can be answered by a simple f test, so the results for all regressors are available with a quick 
look at the standard regression table? When working with polynomials or other specifications, the 
influence of one regressor is captured by several parameters. We can test its significance with an F 
test of the joint null hypothesis that all of these parameters are equal to zero. As an example, let's 
revisit Example 6.2: 


log(price) = fo + Bi log(nox) + Bz log(dist) + Barooms + Barooms? + Bsstratio+u 


The significance of rooms can be assessed with an F test of Ho : B4 = B4 = 0. As discussed in Section 
4.3, such a test can be performed with the command £ test from the module statsmodels. This 
is shown in Script 6.5 (Example-6-2-Ftest.py). 


M —— Script 6.5: Example-6-2-Ftest.py 
import wooldridge as woo 


import numpy as np 
import statsmodels.formula.api as smf 


hprice2 = woo.dataWoo('hprice2') 
n = hprice2.shape[0] 


reg = smf.ols( 
formula-'np.log(price) ~ np.log(nox) +np. log (dist) +rooms+I (rooms««2) +stratio’, 
data=hprice2) 

results = reg. fit () 


# implemented F test for rooms: 
hypotheses = ['rooms = 0’, 'I(rooms ** 2) = 0'] 


ftest = results.f test (hypotheses) 
fstat = ftest.statistic[0][0] 
fpval = ftest .pvalue 


print (f' stat: 
print(f'fpval: 


{£stat}\n’) 
(£pval)Wn') 


[—— — — — ————— Output of Script 6.5: Example-6-2-Ftest.py 
fstat: 110.4187819267064 


fpval: 1.9193250019375434e-40 


2Section 4.1 discusses t tests. 
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6.1.6. Interaction Terms 


Models with interaction terms allow the effect of one variable x; to depend on the value of another 
variable x2. A simple model including an interaction term would be 


y = Bo + Bixi  Baxa + B3xix2 + U. (6.5) 


Of course, we can implement this in Python by defining a new variable containing the product of 
the two regressors. But again, a direct specification in the model formula is more convenient. The 
expression x1:x2 within a formula adds the interaction term xx. Even more conveniently, x1«x2 
adds not only the interaction but also both original variables allowing for a very concise syntax. So 
the model in Equation 6.5 can be specified in Python as either of the two formulas: 
y ~ xl + x2 + xl:x2 e y ~ xl«x2 

If one variable x; is interacted with a set of other variables, they can be grouped by parentheses to 

allow for a compact syntax. For example, the shortest way to express the model equation 


Vy = Bo + Bixi + Bax2 + B3x3  Baxixo + Bsxixs + u (6.6) 


in Python syntax is y ~ x1*(x2 + x3). 


Wooldridge, Example 6.3: Effects of Attendance on Final Exam Performance 


This example analyzes a model including a standardized dependent variable, quadratic terms and 
an interaction. Standardized scores in the final exam are explained by class attendance, prior perfor- 
mance and an interaction term: 


stndfnl = By + Biatndrte + BypriGPA + B3ACT + B4priGPA? + BsACT? + BgpriGPA-atndrte--u 


Script 6.6 (Example-6-3 . py) estimates this model. 
The effect of attending classes is 
astndfnl 
datndrte 
For the average priGPA = 2.59, the script estimates this partial effect to be around 0.0078. It tests the 
null hypothesis that this effect is zero using a simple F test, see Section 4.3. With a p value of 0.0034, this 
hypothesis can be rejected at all common significance levels. 


= Bi + BepriGPA. 


Script 6.6: Example-6-3.py 


import wooldridge as woo 

import numpy as np 

import pandas as pd 

import statsmodels.formula.api as smf 


attend = woo.dataWoo('attend') 
n = attend.shape[0] 


reg = smf.ols(formula-'stndfnl ~ atndrte«priGPA + ACT + I(priGPA««2) + I(ACT««2)', 
data-attend) 
results = reg.fit() 


# print regression table: 

table = pd.DataFrame(('b': round(results.params, 4), 
'se': round(results.bse, 4), 
't': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4)]) 
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print (f' table: \n{table}\n’) 


# estimate for partial effect at priGPA=2.59: 

b = results.params 

partial effect = b[’atndrte’] + 2.59 + b['atndrte:priGPA'] 
print(f'partial effect: {partial_effect}\n’) 


# F test for partial effect at priGPA=2.59: 
hypotheses = 'atndrte + 2.59 * atndrte:priGPA = 0' 
ftest = results.f test (hypotheses) 

fstat = ftest.statistic[0][0] 

fpval = ftest .pvalue 


print (£/ fstat 
print (£’ fpval: 


{fstat}\n’) 
{fpval}\n’) 


p Output of Script 6.6: Example-6-3.py 
table: 


b se t — pval 
Intercept 2.0503 1.3603 1.5072 0.1322 
atndrte -0.0067 0.0102 -0.6561 0.5120 
priGPA -1.6285 0.4810 -3.3857 0.0008 
atndrte:priGPA 0.0056 0.0043 1.2938 0.1962 
ACT -0.1280 0.0985 -1.3000 0.1940 
I(priGPA «« 2) 0.2959 0.1010 2.9283 0.0035 
I(ACT «« 2) 0.0045 0.0022 2.0829 0.0376 


partial effect: 0.007754572228608965 
fstat: 8.632581056740811 


fpval: 0.003414992399585439 


6.2. Prediction 


In this section, we are concerned with predicting the value of the dependent variable y given certain 
values of the regressors x;,...,x,. If these are the regressor values in our estimation sample, we 
called these predictions “fitted values" and discussed their calculation in Section 2.2. Now, we 
generalize this to arbitrary values and add standard errors, confidence intervals, and prediction 
intervals. 


6.2.1. Confidence and Prediction Intervals for Predictions 


Confidence intervals reflect the uncertainty about the expected value of the dependent variable given 

values of the regressors. If we are interested in predicting the college GPA of an individual, prediction 

intervals account for the additional uncertainty regarding the unobserved characteristics reflected by 
the error term u. 
Given a model 

y = Po + Bii + Boxa +-+: + Be cu (6.7) 


we are interested in the expected value of y given the regressors take specific values c,c5,. .., €: 


0 = E(ylxi = e. xk = Ck) = Bo + Bier + Baca ++ ++ Bie (6.8) 
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The natural point estimates are 


Êo = Bo + Ê1c1 + Baca +--+ + Bec (6.9) 


and can readily be obtained once the parameter estimates f», .. . , B are calculated. 

Standard errors and confidence intervals are less straightforward to compute. Wooldridge (2019, 
Section 6.4) suggests a smart way to obtain these from a modified regression. statsmodels pro- 
vides an even simpler and more convenient approach. 

The method predict automatically calculates 4. The method can be called on an object created 
by the fit method. Its argument is a data frame containing the values of the regressors c;,... cy of 
the regressors x;,. . . x, with the same variable names as in the data frame used for estimation. If we 
don’t have one yet, it can for example be specified with pandas as 


pd.DataFrame(('xl':[cl], 'x2':[c2],...,'xk':[ck]), index-['newobservationl']) 


where x1 through xk are the variable names and c1 through ck are the values which can also be 
specified as lists to get predictions at several values of the regressors. See Section 1.2.4 for more on 
data frames and Script 6.7 (Predictions .py) for an example. 


p M Script 6.7: Predictions.py — 
import wooldridge as woo 
import statsmodels.formula.api as smf 
import pan as pd 


gpa2 = woo.dataWoo('gpa2') 


reg = smf.ols(formul 
results - reg.fit() 


'colgpa ~ sat + hsperc + hsize + I(hsize**2)', da! 


# print regression table: 
table = pd.DataFrame({’b’ 


round(results.params, 4), 
: round(results.bse, 4), 
't': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4) }) 
print(f'table: \n{table}\n’) 


# generate data set containing the regressor values for predictions: 
cvaluesl = pd.DataFrame(('sat': [1200], 'hsperc': [30], 

'hsize': [5]), index-['newPersonl']) 
print(f'cvaluesl: \n{cvalues1}\n’) 


# point estimate of prediction (cvaluesl): 
colgpa predl = results.predict (cvalues1) 
print(f'colgpa predl: \n{colgpa_pred1}\n’) 


# define three sets of regressor variables: 
cvalues2 - pd.DataFrame(('sat': [1200, 900, 1400, ], 
'hsperc': [30, 20, 5], 'hsize': [5, 3, 11), 
index-['newPersonl', 'newPerson2', 'newPerson3']) 
print(f'cvalues2: \n{cvalues2}\n’) 


# point estimate of prediction (cvalues2): 
colgpa_pred2 = results.predict (cvalues2) 
print (£’colgpa_pred2: \n{colgpa_pred2}\n’) 
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Output of Script 6.7: Predictions.py 
table: 
b se t pval 

Intercept 1.4927 0.0753 19.8118 0.0000 
sat 0.0015 0.0001 22.8864 0.0000 
hsperc -0.0139 0.0006 -24.6981 0.0000 
hsize -0.0609 0.0165 -3.6895 0.0002 
I(hsize «« 2) 0.0055 0.0023 2.4056 0.0162 
cvaluesl: 

sat hsperc hsize 
newPersonl 1200 30 5 
colgpa predl: 
newPersonl 2.700075 
dtype: float64 
cvalues2: 

sat hsperc hsize 
newPersonl 1200 30 5 
newPerson2 900 20 3 
newPerson3 1400 5 1 
colgpa pred2: 
newPersonl 2.700075 
newPerson2 2.425282 
newPerson3 3.457448 
dtype: float64 


The method get prediction calculates not only Êo (ie. the exact same predictions as the 
method predict), but also 


* standard errors of the predictions (column mean se), 
* confidence intervals (columns mean ci lower and mean ci upper)and 


* prediction intervals (columns obs ci lower and obs ci upper) Wooldridge (2019) ex- 
plains how to calculate the prediction interval manually. 


All you have to do is calling a second method summary frame to provide the significance level. 
Script 6.8 (Examp1e-6-5.py) demonstrates the procedure for a = 5% and 1%. 


Wooldridge, Example 6.5: Confidence Interval for Predicted College GPA 


We try to predict the college GPA, for example to support the admission decisions for our college. Our 
regression model equation is 


colgpa = Bo + Bisat + Bahsperc + Bshsize + Byhsize*+u. 


Script 6.8 (Example-6-5 . py) shows the implementation of the estimation and prediction. The estimation 
results are stored as the variable results. The values of the regressors for which we want to do the 
prediction are stored in the new data frame cvalues2. Then the commands get. prediction and 
summary frame are called. For an SAT score of 1200, a high school percentile of 30 and a high school 
Size of 5 (i.e. 500 students), the predicted college GPA is 2.7. Wooldridge (2019) obtains the same value 
using a general but more cumbersome regression approach. We define two other types of students 
with different values of sat, hsperc, and hsize in the data frame cvalues2. 

Script 6.8 (Examp1e-6-5.py) also calculates the 95% and 99% confidence and prediction intervals. 
The object colgpa PICI 95 contains the 95% confidence interval, for example, which is reported in 
columns mean ci lower and mean ci upper. With 95% confidence we can say that the expected 
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college GPA for students with the features of the student named newPerson1 is between 2.66 and 2.74. 
The object colgpa PICI 99 contains the 99% prediction interval, for example, which is reported in 
columns obs ci lower and obs ci upper. All results are the same as those manually calculated by 
Wooldridge (2019). 


Script 6.8: Example-6-5.py 
import wooldridge as woo 

import statsmodels.formula.api as smf 

import pandas as pd 


gpa2 = woo.dataWoo('gpa2') 


reg = smf.ols(formula-'colgpa ~ sat + hsperc + hsize + I(hsize««2)', data-gpa2) 
results - reg.fit() 


# define three sets of regressor variables: 
cvalues2 - pd.DataFrame(('sat [1200, 900, 1400, 
'hsperc': [30, 20, 5], 'hsize': [5, 3, 1]}, 
index-['newPersonl', 'newPerson2', 'newPerson3']) 


# point estimates and 95% confidence and prediction interva 
colgpa PICI 95 = results.get prediction(cvalues2).summary frame (alpha-0.05) 
print(f'colgpa PICI 95: Wn(colgpa PICI 95)Vn') 


# point estimates and 99% confidence and prediction intervals: 
colgpa PICI 99 = results.get prediction(cvalues2).summary frame (alpha-0.01) 
print(f'colgpa PICI 99: \n{colgpa_PICI_99}\n’) 


Output of Script 6.8: Example-6-5.py 
colgpa PICI 95: 
mean mean se mean ci lower mean ci upper obs ci lower obs ci upper 


0 2.700075 0.019878 2.661104 2.739047 1.601749 3.798402 
1 2.425282 0.014258 2.397329 2.453235 1.321282 3.523273 
2 3.457448 0.027891 3.402766 3.512130 2.358452 4.556444 


colgpa PICI 99: 
mean mean se mean ci lower mean ci upper obs ci lower obs ci, upper 


0 2.700075 0.019878 2.648850 2.751301 1.256386 4.143765 
1 2.425282 0.014258 2.388540 2.462025 0.982034 3.868530 
2 3.457448 0.027891 3.385572 3.529325 2.012879 4.902018 


6.2.2. Effect Plots for Nonlinear Specifications 


In models with quadratic or other nonlinear terms, the coefficients themselves are often difficult 
to interpret directly. We have to do additional calculations to obtain the partial effect at different 
values of the regressors or derive the extreme points. In Example 6.2, we found the number of rooms 
implying the minimum predicted house price to be around 4.4. 

For a better visual understanding of the implications of our model, it is often useful to calculate 
predictions for different values of one regressor of interest while keeping the other regressors fixed at 
certain values like their overall sample means. By plotting the results against the regressor value, we 
get a very intuitive graph showing the estimated ceteris paribus effects of the regressor. 

We already know how to calculate predictions and their confidence intervals from Section 6.2.1. 
Script 6.9 (Effect s—Manual . py) repeats the regression from Example 6.2 and creates an effects plot 
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for the number of rooms. The number of rooms is varied between 4 and 8 and the other variables 
are set to their respective sample means for all predictions. The regressor values and the implied 
predictions are shown in a table and then plotted with their confidence bands. We see the minimum 
at a number of rooms of around 4. The resulting graph is shown in Figure 6.1. 


M — Script 6.9: Effect s-Manual.py 
import wooldridge as woo 
import numpy as np 

import pandas as pd 
import statsmodels.formula.api as smf 
import matplotlib.pyplot as plt 


hprice2 = woo.dataWoo(’hprice2’) 


# repeating the regression from Example 6.2: 

reg = smf.ols( 

np.log(price) ~ np.log(nox) ¢np. log (dist) +rooms+I (rooms+*2) +stratio’, 
data=hprice2) 

results = reg.fit() 


# predictions with rooms = 4-8, all others at the sample mean: 
nox mean = np.mean(hprice2[’nox’]) 
np.mean (hprice2['dist']) 
stratio mean = np.mean(hprice2['stratio']) 
X = pd.DataFrame(('rooms': np.linspace(4, 8, num-5), 
'nox': nox mean, 
‘dist’: dist mean, 
'stratio': stratio mean)) 
print(f'X: \n{X}\n’) 


# calculate 95% confidence interval: 

lpr PICI = results.get prediction(X).summary frame (alpha=0.05) 
lpr CI = lpr PICI[['mean', 'mean ci lower', 'mean ci upper']] 
print(f'lpr CI: Mn(lpr CI)Wn') 


# plot: 

plt.plot(X['rooms'], lpr CI['mean'], color-'black', 
linestyle-'-', label-'') 

plt.plot(X['rooms'], lpr Cl['mean ci upper'], color-'lightgrey', 
linestyle-'--', label-'upper CI’) 

plt.plot(X['rooms'], lpr CI['mean ci lower'], color-'darkgrey', 
linestyle-'--', label-'lower CI') 


plt.ylabel('lprice') 
plt .xlabel (’ rooms’ ) 

plt.legend() 
plt.savefig('PyGraphs/Effects-Manual.pdf') 
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Figure 6.1. Nonlinear Effects in Example 6.2 
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Output of Script 6.9: E£fects-Manual.py 
m 
rooms nox dist stratio 


0 4.0 5.549783 3.795751 18.459289 
1 5.0 5.549783 3.795751 18.459289 
2 6.0 5.549783 3.795751 18.459289 
3 7.0 5.549783 3.795751 18.459289 
4 8.0 5.549783 3.795751 18.459289 
lpr CI 

mean mean ci lower mean ci upper 
0 9.661702 9.499811 9.823593 
1 9.676940 9.610215 9.743665 
2 9.816700 9.787055 9.846345 
3 10.080983 10.042409 10.119557 
4 10.469788 10.383361 10.556215 


7. Multiple Regression Analysis with 
Qualitative Regressors 


Many variables of interest are qualitative rather than quantitative. Examples include gender, race, 
labor market status, marital status, and brand choice. In this chapter, we discuss the use of qualitative 
variables as regressors. Wooldridge (2019, Section 7.5) also covers linear probability models with a 
binary dependent variable in a linear regression. Since this does not change the implementation, we 
will skip this topic here and cover binary dependent variables in Chapter 17. 

Qualitative information can be represented as binary or dummy variables which can only take the 
value zero or one. In Section 7.1, we see that dummy variables can be used as regressors just as any 
other variable. An even more natural way to store yes/no type of information in Python is to use 
Boolean variables which can also be directly used as regressors, see Section 7.2. 

While qualitative variables with more than two outcomes can be represented by a set of dummy 
variables, the more natural and convenient way to do this are categorical variables as covered in 
Section 1.2.4. A special case in which we wish to break a numeric variable into categories is discussed 
in Section 7.4. Finally, Section 7.5 revisits interaction effects and shows how these can be used with 
categorical variables to conveniently allow and test for difference in the regression equation. 


7.1. Linear Regression with Dummy Variables as Regressors 


If qualitative data are stored as dummy variables (i.e. variables taking the values zero or one), these 
can easily be used as regressors in linear regression. If a single dummy variable is used in a model, its 
coefficient represents the difference in the intercept between groups, see Wooldridge (2019, Section 
72). 

A qualitative variable can also take g > 2 values. A variable MobileOS could for example take one 
of the g = 4 values "Android", "iOS", "Windows", or "other". This information can be represented 
by g — 1 dummy variables, each taking the values zero or one, where one category is left out to 
serve as a reference category. They take the value one if the respective operating system is used and 
zero otherwise. Wooldridge (2019, Section 7.3) gives more information on these variables and their 
interpretation. 

Here, we are concerned with implementing linear regressions with dummy variables as regressors. 
Everything works as before once we have generated the dummy variables. In the example data sets 
provided with Wooldridge (2019), this has usually already been done for us, so we don't have to 
learn anything new in terms of implementation. We show two examples. 
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Wooldridge, Example 7.1: Hourly Wage Equation 


We are interested in the wage differences by gender and regress the hourly wage on a dummy vari- 
able which is equal to one for females and zero for males. We also include regressors for education, 
experience, and tenure. The implementation with stat smodels is standard and the dummy variable 
female is used just as any other regressor as shown in Script 7.1 (Example-7-1.py). Its estimated coeffi- 
cient of —1.81 indicates that on average, a woman makes $1.81 per hour less than a man with the same 
education, experience, and tenure. 


Script 7.1: Example-7-1.py 


import wooldridge as woo 
import pandas as pd 
import statsmodels.formula.api as smf 


wagel = woo.dataWoo('wagel') 


= smf.ols(formula-'wage - female + educ + exper + tenure’, data-wagel) 
results = reg. fit () 


# print regression table: 
table = pd.DataFrame(('b': round(results 
'si round (results.b: 
't': round(results.tvalues, 4), 
‘pval’: round(results.pvalues, 4))) 
print(f'table: \n{table}\n’) 


Output of Script 7.1: Example-7-1.py 


table: 

b se t — pval 
Intercept -1.5679 0.7246 -2.1640 0.0309 
female -1.8109 0.2648 -6.8379 0.0000 
educ 0.5715 0.0493 11.5836 0.0000 
exper 0.0254 0.0116 2.1951 0.0286 
tenure 0.1410 0.0212 6.6632 0.0000 
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Wooldridge, Example 7.6: Log Hourly Wage Equation 


We used log wage as the dependent variable and distinguish gender and marital status using a quali- 
tative variable with the four outcomes "single female", “single male”, “married female”, and “married 
male”. We actually implement this regression using an interaction term between married and female 
in Script 7.2 (&xamp1e-7-6.py). Relative to the reference group of single males with the same edu- 
cation, experience, and tenure, married males make about 21.3% more (the coefficient of married), 
and single females make about 11.0% less (the coefficient of female). The coefficient of the inter- 
action term implies that married females make around 30.1%-21.3%=8.7% less than single females, 
30.196 11.096-41.196 less than married males, and 30. 196-1 1.096-21.396- 19.896 less than single males. Note 
once again that the approximate interpretation as percent may be inaccurate, see Section 6.1.3. 


Script 7.2: Example-7-6.py 
import wooldridge as woo 

import numpy as np 

import pandas as pd 

import statsmodels.formula.api as smf 


wagel = woo.dataWoo('wagel') 
reg = smf.ols(formula-'np.log(wage) ~ married«female + educ + exper +’ 

/I(expere«2) + tenure + I(tenure««2)', data-wagel) 
results = reg. fit() 


# print regression table: 
table = pd.DataFrame({’b’ 


round(results.params, 4), 
round(results.bse, 4), 
't': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4))) 
print(f'table: \n{table}\n’) 


Output of Script 7.2: Example-7-6.py 


table: 

b se t — pval 
Intercept 0.3214 0.1000 3.2135 0.0014 
married 0.2127 0.0554 3.8419 0.0001 
female -0.1104 0.0557 -1.9797 0.0483 
married:female -0.3006 0.0718 -4.1885 0.0000 
educ 0.0789 0.0067 11.7873 0.0000 
exper 0.0268 0.0052 5.1118 0.0000 
I(exper ** 2) -0.0005 0.0001 -4.8471 0.0000 
tenure 0.0291 0.0068 4.3016 0.0000 
I(tenure ** 2) -0.0005 0.0002 -2.3056 0.0215 
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7.2. Boolean Variables 


A natural way for storing qualitative yes/no information in Python is to use Boolean variables intro- 
duced in Section 1.2.2. They can take the values True or False and can be transformed into a 0/1 
dummy variable with the function int where True=1 and False=0. 0/1-coded dummies can vice 
versa be transformed into logical variables with the function bool. 

Instead of transforming Boolean variables into dummies, they can be directly used as regressors. 
The coefficient is then named varname [T.True] indicating that True was treated as 1. Script 7.3 
(Examp1le-7-1-Boolean. py) repeats the analysis of Example 7.1 with the regressor female being 
coded as bool instead of a 0/1 dummy variable.! 


p — Script 7.3: Example-7-1-Boolean.py 
import wooldridge as woo 
import pandas as pd 

import statsmodels.formula.api as smf 


wagel = woo.dataWoo('wagel') 


# regression with boolean variable: 
wagel['isfemale'] - (wagel['female'] -- 1) 

reg = smf.ols(formula-'wage - isfemale + educ + exper + tenure’, data-wagel) 
results - reg.fit() 


# print regression table: 
table = pd.DataFrame({’b’ 


print (f/table: \n{table}\n’) 


Output of Script 7.3: Example-7-1-Boolean.py 


table: 

b se t — pval 
Intercept -1.5679 0.7246 -2.1640 0.0309 
isfemale[T.True] -1.8109 0.2648 -6.8379 0.0000 
educ 0.5715 0.0493 11.5836 0.0000 
exper 0.0254 0.0116 2.1951 0.0286 
tenure 0.1410 0.0212 6.6632 0.0000 


In real-world data sets, qualitative information is often not readily coded as logical or dummy 
variables, so we might want to create our own regressors. Suppose a qualitative variable saved as 
the numpy array OS takes one of the three string values “Android”, “iOS”, “Windows”, or "other". 
We can manually define the three relevant logical variables with "Android" as the reference category 
with 


ios 
wind 
oth 


A more convenient and elegant way to deal with qualitative variables are categorical variables dis- 
cussed in the next section. 


‘To be more precise, a numpy version of the type boo] is used internally to allow for vectorized operations. 
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7.3. Categorical Variables 


We have introduced categorical variables of type Categorical in Section 1.2.4. They take one of a 
given set of outcomes which can be labeled arbitrarily. This makes them the natural variable type to 
store qualitative information. 

In a linear regression performed by statsmodels we can easily transform any variable into a 
categorical variable using the function C in the definition of the formula. The function o1s is clever 
enough to implicitly add g — 1 dummy variables if the variable has g outcomes. As a reference 
category, the first category is left out by default. 

Script 7.4 (Regr-Categorical.py) shows how categorical variables are used. It uses the data 
set CPS19852 This data set is similar to the one used in Examples 7.1 and 7.6 in that it contains 
wage and other data for 534 individuals. The frequency tables for the two variables gender and 
occupation are shown in the output. The variable gender has two categories male and female. 
The variable occupation has six categories. 

In the output, the coefficients are labeled with a combination of the variable and category name. 
As an example, the estimated coefficient of 0.224 for C (gender) [T. male] in results implies that 
men make about 22.4% more than women who are the same in terms of the other regressors. Em- 
ployees in technical positions earn around 1*6 (see coefficient of C (occupation) [T.technical]) 
less than otherwise equal management positions (who are the reference category). 

We can choose different reference categories using a second argument of the C command, where 
we provide a new reference group somegroup with the command Treatment (' somegroup' ). In 
the specification results newref, we choose male and technical. When we rerun the same 
regression command, we see the expected results: Variables like education and experience get 
the same coefficients. The dummy variable for females gets the negative of what the males got 
previously. Obviously, it is equivalent to say "female log wages are lower by 0.224" and "male log 
wages are higher by 0.224". 

The coefficients for the occupation are now relative to technical. From the first regression we 
already knew that technical positions make 1% less than managers, so it is not surprising that in 
the second regression we find that managers make 1% more than technical positions. The other 
occupation coefficients are higher by 0.010085 implying the same relative comparisons as in the first 
specification. 


- Script 7.4: Regr-Categorical.py 
import pandas as pd 
import numpy as np 
import statsmodels.formula.api as smf 


CPS1985 - pd.read csv('data/CPS1985.csv') 
# rename variable to make outputs more compact: 
CPS1985['oc'] = CPS1985['occupation'] 


# table of categories and frequencies for two categorical variables: 
freq gender - pd.crosstab(CPS1985['gender'], columns-'count') 
print(f'freq gender: \n{freq_gender}\n’) 


freq occupation = pd.crosstab(CPS1985['oc'], columns-'count') 
print(f'freq occupation: \n{freq_occupation}\n’ ) 


2The data set is included in the R package AER, see https: / /cran. r-project.org/web/packages/AER/index.html. 
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# directly using categorical variables in regression formula: 
reg = smf.ols(formula-'np.log(wage) ~ education +’ 

"experience + C(gender) + C(oc)’, data=CPS1985) 
results = reg.fit() 


# print regression tabl. 
table = pd.DataFrame(('b': round(results.params, 4), 
‘se’: round(results.bse, 4), 
't': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4)]) 
print(f'table: \n{table}\n’) 


# rerun regression with different reference category: 
reg newref = smf.ols(formula-'np.log(wage) ~ education + experience + ’ 
'/C(gender, Treatment("male")) + ' 
/C(oc, Treatment("technical"))', data-CPS1985) 
results newref - reg newref.fit() 


# print results: 
table newref = pd.DataFrame (('b' 


round(results newref.params, 4), 

round(results newref.bse, 4), 
/t': round(results newref.tvalues, 
'pval': round (x 

print(f'table newref: \n(table_newref}\n’) 


Output of Script 7.4: Regr-Categorical.py 
freq gender: 
col 0 count 
gender 
female 245 
male 289 


freq occupation: 
col 0 count 
oc 

management 55 
office 97 
sales 38 
services 83 
technical 105 
worker 156 


table: 

b se t pval 
Intercept 0.9050 .0000 
C(gender)|T.male] 0.2238 -0000 
C(oc) [T.office] -0.2073 .0078 
C(oc) [T. sales] -0.3601 -0001 
C(oc)[T.services] -0.3626 .0000 
C(oc)[T.technical] -0.0101 .8916 
C(oc) [T.worker] -0.1525 .0462 
education 0.0759 .0000 
experience 0.0119 .0000 
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table_newref: 


b se t — pval 
Intercept 1.1187 0.1765 6.3393 0.0000 
C(gender, Treatment ("male")) [T. female] -0.2238 0.0423 -5.2979 0.0000 
C(oc, Treatment("technical"))[T.management] 0.0101 0.0740 0.1363 0.8916 
C(oc, Treatment ("technical") ) [T.office] -0.1972 0.0678 -2.9082 0.0038 
C(oc, Treatment ("technical") ) [T.sales] -0.3500 0.0863 -4.0541 0.0001 
C(oc, Treatment ("technical"))[T.services] -0.3525 0.0750 -4.7030 0.0000 
C(oc, Treatment ("technical") ) [T.worker] -0.1425 0.0705 -2.0218 0.0437 
education 0.0759 0.0101 7.5449 0.0000 
experience 0.0119 0.0017 7.0895 0.0000 


7.3.1. ANOVA Tables 


A natural question to ask is whether a regressor has additional statistically significant explanatory 
power in a regression model, given all the other regressors. In simple model specifications, this 
question can be answered by a simple t test, so the results for all regressors are available with a 
quick look at the standard regression table? When working with categorical variables, polynomials 
or other specifications, the influence of one variable is captured by several regressors. In the example 
of Script 7.4 (Regr-Categorical.py), the effect of occupation is captured by the five regressors 
of the respective dummy variables. 

We can test its significance with an F test of the joint null hypothesis that all of these parame- 
ters are equal to zero. As an example, let's revisit the underlying model in reg from Script 7.4 
(Regr-Categorical.py): 


log(wage) =By + Bieducation + Byexperience + B3gender + B4office 
+ Bssales + Beservices + Bytechnical + Bgworker +u 


The significance of occupation can be assessed with an F test of Hy : B4 = Bs = Bo = B; = Ps = 0. 
As discussed in Section 4.3, such a test can be performed with the command £ test from the 
module statsmodels. 

A Type II ANOVA (analysis of variance) table does exactly this for each variable in the model 
and displays the results in a clearly arranged table. statsmodels implements this in the method 
anova lm. The example in Script 7.5 (Regr-Categorical-Anova.py)shows that all the relevant 
results from our previous F test can be found again in the row labelled occupation. Column df 
indicates that this test involves five parameters. All other variables enter the model with a single 
parameter. Consequently the value of their F test statistics corresponds to the respective squared t 
statistics in the object results. 

The ANOVA table also allows to quickly compare the relevance of the regressors. The first column 
shows the sum of squared deviations explained by the variables after all the other regressors are 
controlled for. ANOVA tables of Types I and III are less often of interest. They differ in what other 
variables are controlled for when testing for the effect of one regressor. 

Script 7.5 (Regr-Categorical-Anova.py) shows the ANOVA Type II table. We see that 
education has the highest explanatory power. Moreover, occupation has a highly significant 
effect on wages. The explained sum of squares (after controlling for all other regressors) is higher 
than that of gender. But since it is based on five parameters instead of one, the F statistic is lower. 


3Section 4.1 discusses t tests. 
"In statsmodels, this functionality is not located in stat smodels . formula . api, where we find formula based estima- 
tion routines. Instead it is in statsmodel1s . api, so we import another part of the module as the alias sm. 
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~~ Žž Script 7.5: Regr-Categorical-Anova.py 
import pandas as pd 
import numpy as np 
import statsmodels.api as sm 

import statsmodels.formula.api as smf 


CPS1985 = pd.read csv('data/CPS1985.csv') 


# run regression: 

reg = smf.ols( 
formula='np.log(wage) ~ education + experience + gender + occupation’, 
data-CPS1985) 

results - reg.fit() 


# print regression table: 
table reg - pd.DataFrame(('b': round(results.params, 4), 
, round(results.bse, 4), 
'/t': round(results.tvalues, 4), 
‘pval’: round(results.pvalues, 4))) 
\n{table_reg}\n’) 


print(f'table re 


# ANOVA table 
table anova = sm.stats.anova lm(results, typ=2) 
print(f'table anova: \n{table_anova}\n’) 


Output of Script 7.5: Regr-Categorical-Anova.py 


table reg: 

b se t — pval 
Intercept 0.9050 0.1717 5.2718 0.0000 
gender[T.male] 0.2238 0.0423 5.2979 0.0000 
occupation[T.office] -0.2073 0.0776 -2.6699 0.0078 
occupation[T.sales] -0.3601 0.0936 -3.8455 0.0001 
occupation[T.services] -0.3626 0.0818 -4.4305 0.0000 
occupation[T.technical] -0.0101 0.0740 -0.1363 0.8916 
occupation[T.worker] -0.1525 0.0763 -1.9981 0.0462 
education 0.0759 0.0101 7.5449 0.0000 
experience 0.0119 0.0017 7.0895 0.0000 


table anova: 


sum sq df F PR(>F) 
gender 5.414018 1.0 28.067296 1.727015e-07 
occupation 7.152529 5.0 7.416013 9.805485e-07 
education 10.980589 1.0 56.925450 2.010374e-13 
experience 9.695055 1.0 50.261001 4.365391e-12 
Residual 101.269451 525.0 NaN NaN 
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7.4. Breaking a Numeric Variable Into Categories 


Sometimes, we do not use a numeric variable directly in a regression model because the implied 
linear relation seems implausible or inconvenient to interpret. As an alternative to working with 
transformations such as logs and quadratic terms, it sometimes makes sense to estimate different 
levels for different ranges of the variable. Wooldridge (2019, Example 7.8) gives the example of the 
ranking of a law school and how it relates to the starting salary of its graduates. 

Given a numeric variable, we need to generate a categorical variable to represent the range into 
which the rank of a school falls. In Python, the command cut from pandas is very convenient for 
this. It takes a numeric variable and a list of cut points and returns a categorical variable. By default, 
the upper cut points are included in the corresponding range. 


Wooldridge, Example 7.8: Effects of Law School Rankings on Starting Salaries 


The variable rank of the data set LAWSCE85 is the rank of the law school as a number between 1 and 
175. We would like to compare schools in the top 10, ranks 11-25, 26-40, 41-60, and 61-100 to the 
reference group of ranks above 100. So in Script 7.6 (Examp1e-7-8.py), we store the cut points 0, 10, 
25, 40, 60, 100, and 175 in a variable cutpts. In the data frame 1awsch85, we create our new variable 
rc using the cut command. 

To be consistent with Wooldridge (2019), we do not want the top 10 schools as a reference category 
but the last category. It is chosen with the second argument of the c command. The regression results 
imply that graduates from the top 10 schools collect a starting salary which is around 70% higher than 
those of the schools below rank 100. In fact, this approximation is inaccurate with these lai numbers 
and the coefficient of 0.7 actually implies a difference of exp(0.7)-1=1.013 or 101.396. 

The ANOVA table at the end of the output shows that at a 5% significance level, the school rank is the 
only variable that has a significant explanatory power for the salary in this specification. 


Script 7.6: Example-7-8.py 
import wooldridge as woo 
import numpy 
import pandas 
import statsmodels.api as sm 
import statsmodels.formula.api as smf 


lawsch85 = woo.dataWoo ('lawsch85') 


# define cut points for the rank: 
cutpts - [0, 10, 25, 40, 60, 100, 175] 


# create categorical variable containing ranges for the rank: 


lawsch85['rc'] = pd.cut (lawsch85[‘rank’], bins-cutpts, 
labels-['(0,10]', '(10,25]', '(25,40]', 
'(40,60]', '(60,100]', '(100,1751'1) 


# display frequencies: 
freq = pd.crosstab(lawsch85[’rc’], columns-'count') 
print (f' freq: \n{freq}\n’) 


# run regression: 
reg = smf.ols(formula-'np.log(salary) ~ C(rc, Treatment ("(100,175]")) +’ 
‘LSAT + GPA + np.log(libvol) + np.log(cost)', 
data-lawsch85) 
results = reg.fit() 
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# print regression table: 

table reg = pd.DataFrame(('b': round(results.params, 4), 
'se': round(results.bse, 4), 
/t': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4)}) 

print(f'table reg: \n{table_reg}\n’) 


# ANOVA table: 
table anova = sm.stats.anova lm(results, typ=2) 
print(f'table anova: \n{table_anova}\n’) 


Output of Script 7.6: Example-7-8.py 


freq: 
col 0 count 
rc 
(0,10] 10 
(10,25] 16 
(25,40] 13 
(40, 60] 18 
(60, 100] 37 
(100,175] 62 
table reg: 
b se t — pval 

Intercept 9.1653 0.4114 22.2770 0.0000 
C(rc, Treatment (" (100,175]")) [T. (0,10]] 0.6996 0.0535 13.0780 0.0000 
C(rc, Treatment("(100,175]")) [T.(10,25]] 0.5935 0.0394 15.0493 0.0000 
C(rc, Treatment("(100,175]")) [T.(25,40]] 0.3751 0.0341 11.0054 0.0000 
C(rc, Treatment("(100,175]")) [T.(40,60]] — 0.2628 0.0280 9.3991 0.0000 
C(rc, Treatment("(100,175]")) [T. (60,100]] 0.1316 0.0210 6.2540 0.0000 
LSAT 0.0057 0.0031 1.8579 0.0655 
GPA 0.0137 0.0742 0.1850 0.8535 
np.log (libvol) 0.0364 0.0260 1.3976 0.1647 
np. log (cost) 0.0008 0.0251 0.0335 0.9734 
table_anova: 

sum_sq df F PROF) 
C(rc, Treatment("(100,175]")) 1.868867 5.0 50.962988 1.174406e-28 
LSAT 0.025317 1.0 3.451900 6.551320e-02 
GPA 0.000251 1.0 0.034225 8.535262e-01 
np.log (libvol) 0.014327 1.0 1.953419 1.646748e-01 
np. log (cost) 0.000008 1.0 0.001120 9.733564e-01 
Residual 0.924111 126.0 NaN NaN 
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7.5. Interactions and Differences in Regression Functions Across 
Groups 


Dummy and categorical variables can be interacted just like any other variable. Wooldridge (2019, 
Section 7.4) discusses the specification and interpretation in this setup. An important case is a model 
in which one or more dummy variables are interacted with all other regressors. This allows the 
whole regression model to differ by groups of observations identified by the dummy variable(s). 

The example from Wooldridge (2019, Section 7.4-c) is replicated in Script 7.7 (Dummy- Interact . py). 
Note that the example only applies to the subset of data with spring==1. We use the subset 
option of ols directly to define the estimation sample. Other than that, the script does not introduce 
any new syntax but combines two tricks we have seen previously: 

* The dummy variable female is interacted with all other regressors using the “*” formula 
syntax with the other variables contained in parentheses, see Section 6.1.6. 


* The F test for all interaction effects is performed using the command £ test. 


Script 7.7: Dummy-Interact.py - 
import wooldridge as woo 
import pandas as pd 
import statsmodels.formula.api as smf 


gpa3 = woo.dataWoo(’ gpa3’) 


# model with full interactions with female dummy (only for spring data): 
reg = smf.ols(formula-'cumgpa ~ female * (sat + hsperc + tothrs)', 
data=gpa3, subset-(gpa3['spring'] == 1)) 


results = reg. fit() 


round(results.params, 4), 
round(results.bse, 4), 
round(results.tvalues, 4), 
4) 


t 
'pval': round (results .pvalu 
print (f' table: \n{table}\n’) 


# F-Test for HO (the interaction coefficients of ‘female’ are zero): 
hypotheses = ['female = 0’, 'female:sat = 0’, 
'female:hsperc = 0’, 'female:tothrs = 0'] 
ftest = results. f test (hypotheses) 
fstat = ftest.statistic[0] [0] 
fpval = ftest.pvalue 


print(f'fstat: {fstat}\n’) 
print(f'fpval: (fpval)Wn') 
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Output of Script 7.7: Dumny-Interact.py 


table: 

b se t — pval 
Intercept 1.4808 0.2073 7.1422 0.0000 
female -0.3535 0.4105 -0.8610 0.3898 
sat 0.0011 0.0002 5.8073 0.0000 
hsperc -0.0085 0.0014 -6.1674 0.0000 
tothrs 0.0023 0.0009 2.7182 0.0069 
female:sat 0.0008 0.0004 1.9488 0.0521 
female:hsperc -0.0005 0.0032 -0.1739 0.8621 
female:tothrs -0.0001 0.0016 -0.0712 0.9433 


fstat: 8.1791116370471 


fpval: 2.544637191818678e-06 


We can estimate the same model parameters by running two separate regressions, one for females 
and one for males, see Script 7.8 (Dummy-Interact-Sep.py). We see that in the joint model, the 
parameters without interactions (Intercept, sat, hsperc, and tothrs) apply to the males and 
the interaction parameters reflect the differences to the males. 

To reconstruct the parameters for females from the joint model, we need to add the two respective 
parameters. The intercept for females is 1.4808 — 0.3535 = 1.1273 and the coefficient of sat for 
females is 0.0011 + 0.0008 ~ 0.0018. 

Script 7.8: Dummy-Interact-Sep.py 
import wooldridge as woo 


import pandas as pd 
import statsmodels.formula.api as smf 


gpa3 = woo.dataWoo('gpa3') 


# estimate model for males (& spring data): 

reg m = smf.ols(formula-'cumgpa ~ sat + hsperc + tothrs’, 
a3, 

gpa3[' spring'] 


1) & (gpa3['female'] -- 0)) 


ion table: 
table m - pd.DataFrame(('b': round(results m.params, 4), 
‘se’: round(results m.bse, 4), 
't': round(results m.tvalues, 4), 
‘pval’: round(results m.pvalues, 4))) 
print(f'table m: \n{table_m}\n’) 


# estimate model for females (& spring data): 
reg f = smf.ols(formula-'cumgpa ~ sat + hsperc + tothrs', 
data-gpa3, 
subset- (gpa3[' spring’ ] 
results f - reg f.fit() 


1) & (gpa3['female'] 


4 print regression table: 

table f = pd.DataFrame(('b': round(results f.params, 4), 
‘se’: round(results f.bse, 4), 
't': round(results f.tvalues, 4), 
'pval': round(results f.pvalues, 4)}) 

print(f'table f: \n{table_£}\n’) 
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table_m: 


Intercept 
sat 
hsperc 
tothrs 


table_f: 


Intercept 
sat 
hsperc 
tothrs 


-0. 
0. 


Output of Script 7.8: Dumny-Interact-Sep.py 


se t pval 
0.2060 7.1894 0.0000 
0.0002 5.8458 0.0000 
0.0014 -6.2082 0.0000 
0.0009 2.7362 0.0066 

se t pval 
0.3616 3.1176 0.0025 
0.0003 5.1950 0.0000 
0.0029 -3.0956 0.0027 
0.0014 1.5817 0.1174 


8. Heteroscedasticity 


The homoscedasticity assumptions SLR.5 for the simple regression model and MLR.5 for the multiple 
regression model require that the variance of the error terms is unrelated to the regressors, i.e. 


Var(u|xi,...,xy) = 07. (8.1) 


Unbiasedness and consistency (Theorems 3.1, 5.1) do not depend on this assumption, but the sam- 
pling distribution (Theorems 3.2, 4.1, 5.2) does. If homoscedasticity is violated, the standard errors 
are invalid and all inferences from t, F and other tests based on them are unreliable. Also the 
(asymptotic) efficiency of OLS (Theorems 3.4, 5.3) depends on homoscedasticity. Generally, ho- 
moscedasticity is difficult to justify from theory. Different kinds of individuals might have different 
amounts of unobserved influences in ways that depend on regressors. 

We cover three topics: Section 8.1 shows how the formula of the estimated variance-covariance 
can be adjusted so it does not require homoscedasticity. In this way, we can use OLS to get unbiased 
and consistent parameter estimates and draw inference from valid standard errors and tests. Section 
8.2 presents tests for the existence of heteroscedasticity. Section 8.3 discusses weighted least squares 
(WLS) as an alternative to OLS. This estimator can be more efficient in the presence of heteroscedas- 


ticity. 


8.1. Heteroscedasticity-Robust Inference 


Wooldridge (2019, Section 8.2) presents formulas for heteroscedasticity-robust standard errors. In 
statsmodels, an easy way to do these calculations is to make use of the argument cov_type 
in the method £it. The argument cov_type can produce several refined versions of the White 
formula presented by Wooldridge (2019). 
If the regression model obtained by ols is stored in the variable reg, the variance-covariance 
matrix can be calculated using 
* reg. fit (cov_type=’nonrobust’) or reg. fit() for the default homoscedasticity-based 
standard errors. 
e reg. fit (cov_type=’ HCO’) for the classical version of White's robust variance-covariance 
matrix presented by Wooldridge (2019, Equation 8.4 in Section 8.2). 
e reg. fit (cov_type=’HC1’) for a version of White's robust variance-covariance matrix cor- 
rected by degrees of freedom. 
e reg. fit (cov_type=’HC2’) fora version with a small sample correction. This is the default 
behavior of Stata. 


e reg. fit (cov_type=’HC3’) for the refined version of White's robust variance-covariance 
matrix. 
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Regression tables with coefficients, standard errors, f statistics and their p values are based on the 
specified method of variance-covariance estimation. To perform F tests of a joint hypothesis for an 
estimated model the syntax is the same as in Section 4.3. 


Wooldridge, Example 8.2: Heteroscedasticity-Robust Inference 


Scripts 8.1 (Example-8-2.py) and 8.2 (Example-8-2-cont.py) demonstrate these commands. 
results default and results white use the usual standard errors and the classical White standard 
errors respectively. This reproduces standard errors reported in Wooldridge (2019). 

For the F tests shown in Script 8.2 (Examp1e-8-2-cont . py), three versions are calculated and displayed. 
The results generally do not differ a lot between the different versions. This is an indication that het- 
eroscedasticity might not be a big issue in this example. To be sure, we would like to have a formal test 
as discussed in the next section. 


r Þ Seript 8.1: Example-8-2.py — 4 
import wooldridge as woo 

import pandas as pd 

import statsmodels.formula.api as smf 


gpa3 = woo.dataWoo(’gpa3’) 
# define regression model: 


reg = smf.ols(formula-'cumgpa ~ sat + hsperc + tothrs + female + black + white’, 
data-gpa3, subset-(gpa3['spring'] == 1)) 


# estimate default model (only for spring data): 
results default = reg. fit () 


round(results default.params, 5), 
round(results default.bse, 5), 
't': round(results default.tvalues, 5), 
'pval': round(results default.pvalues, 5) }) 
print(f'table default: \n{table_default)}\n’) 


table default = pd.DataFrame(('b" 


# estimate model with White SE (only for spring data): 
results white = reg.fit(cov type-'HC0') 


table white - pd.DataFrame(('b': round(results white.params, 5), 
‘se’: round(results white.bse, 5), 
't': round(results white.tvalues, 5), 
'pval': round(results white.pvalues, 5)]) 
print(f'table white: \n{table_white}\n’) 


# estimate model with refined White SE (only for spring data): 
results refined = reg.fit(cov type-'HC3') 


table refined = pd.DataFrame({’b’: round(results refined.params, 5), 
'se': round(results refined.bse, 5), 
't': round(results refined.tvalues, 5), 
'pval': round(results refined.pvalues, 5)]) 
print(f'table refined: W(table refined) n') 
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Output of Script 8.1: Example-8-2.py 
table_default: 


b se t pval 
Intercept 1.47006 0.22980 6.39706 0.00000 
sat 0.00114 0.00018 6.38850 0.00000 
hsperc -0.00857 0.00124 -6.90600 0.00000 
tothrs 0.00250 0.00073 3.42551 0.00068 
female 0.30343 0.05902 5.14117 0.00000 
black -0.12828 0.14737 -0.87049 0.38462 
white -0.05872 0.14099 -0.41650 0.67730 
table white: 
b se t pval 
Intercept 1.47006 0.21856 6.72615 0.00000 
sat 0.00114 0.00019 6.01360 0.00000 
hsperc -0.00857 0.00140 -6.10008 0.00000 
tothrs 0.00250 0.00073 3.41365 0.00064 
female 0.30343 0.05857 5.18073 0.00000 
black -0.12828 0.11810 -1.08627 0.27736 
white -0.05872 0.11032 -0.53228 0.59453 
table refined: 
b se t pval 
Intercept 1.47006 0.22938 6.40885 0.00000 
sat 0.00114 0.00020 5.84017 0.00000 
hsperc -0.00857 0.00144 -5.93407 0.00000 
tothrs 0.00250 0.00075 3.34177 0.00083 
female 0.30343 0.06004 5.05388 0.00000 
black -0.12828 0.12819 -1.00074 0.31695 
0.12044 -0.48758 0.62585 


white -0.05872 


p Script 8.2: Example-8-2-cont.py 
import wooldridge as woo 
import statsmodels.formula.api as smf 


gpa3 = woo.dataWoo('gpa3') 


# definition of model and hypotheses: 

reg = smf.ols(formula-'cumgpa ~ sat + hsperc + tothrs + female + black + white’, 
data-gpa3, subset=(gpa3[’spring’] == 1)) 

hypotheses = ['black = 0’, ‘white = 0'] 


# F-Tests using different variance-covariance formulas: 
# ususal VCOV: 

results default = reg. fit() 

ftest default = results default.f test (hypotheses) 
fstat default - ftest default.statistic[0][0] 

fpval default - ftest default.pvalue 

print(f'fstat default: (fstat default)Wn') 
print(f'fpval default: (fpval default)Wn') 


4 refined White VCOV: 
results hc3 = reg.fit(cov type-'HC3') 


ftest hc3 = results hc3.f test (hypotheses) 
fstat hc3 = ftest_hc3.statistic[0] [0] 
fpval hc3 = ftest hc3.pvalue 


print(f'fstat hc3: {fstat_hc3}\n’) 
print(f'fpval hc3: {fpval_hc3}\n’) 
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# classical White VCOV: 

results_hc0 = reg. fit (cov_type=’ HCO’) 
ftest_hc0 = results hc0.f test (hypotheses) 
fstat hcO0 = ftest_hc0.statistic[0] [0] 
fpval_hcO = ftest hc0.pvalue 
print(f'fstat hc0: {fstat_hc0}\n’) 
print(f'fpval hc0: {fpval_hc0}\n’) 


E — — — — Output of Script 8.2: Example-8-2-cont.py 
fstat default: 0.6796041956073398 


fpval default: 0.5074683622584049 
fstat hc3: 0.6724692957656673 
fpval hc3: 0.5110883633440992 
fstat hc0: 0.7477969818036272 


fpval hc0: 0.4741442714738484 


8.2. Heteroscedasticity Tests 


The Breusch-Pagan (BP) test for heteroscedasticity is easy to implement with basic OLS routines. 
After a model 


y = Bot ixi te Bet (8.2) 


is estimated, we obtain the residuals ñ; for all observations i = 1,...,n. We regress their squared 
value on all independent variables from the original equation. We can either look at the standard F 
test of overall significance printed for example by the summary method. Or we can use an LM test 
by multiplying the R? from the second regression with the number of observations. 

In statsmodels, this is easily done. Remember that the residuals from a regression are saved 
as resid in the result object that is returned by fit. Their squared value can be stored in a new 
variable to be used as a dependent variable in the second stage. 

The LM version of the BP test is even more convenient to use with the statsmodels function 
stats.diagnostic.het breuschpagan. It can be used directly as demonstrated in Script 8.3 
(Example-8-4.py)to compute the test statistic and corresponding p value. 


Wooldridge, Example 8.4: Heteroscedasticity in a Housing Price Equation 


Script 8.3 (Example-8-4.py) implements the F and LM versions of the BP test. The command 
stats.diagnostic.het breuschpagan simply takes the regression residuals and the regressor matrix 
as an argument and delivers a test statistic of LM — 14.09. The corresponding p value is smaller than 
0.003 so we reject homoscedasticity for all reasonable significance levels. 

The output also shows the manual implementation of a second stage regression where we regress 
squared residuals on the independent variables. We can directly interpret the reported F statistic of 
5.34 and its p value of 0.002 as the F version of the BP test. We can manually calculate the LM statistic 
by multiplying the reported R? — 0.16 with the number of observations n — 88. 

We replicate the test for an alternative model with logarithms discussed by Wooldridge (2019) together 
with the White test in Example 8.5 and Script 8.4 (Examp1e-8-5.py). 
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Script 8.3: Example-8-4.py 


import wooldridge as woo 
import pandas as pd 

import statsmodels.api as sm 

import statsmodels.formula.api as smf 
import patsy as pt 


hpricel = woo.dataWoo('hpricel') 


# estimate model: 
reg = smf.ols(formula-'price - lotsize + sqrft + bdrms', data-hpricel) 
results - reg.fit() 
table results = pd.DataFrame({’b’: round(results.params, 4), 
round(results.bse, 4), 
round(results.tvalues, 4), 

'pval': round(results.pvalues, 4)]) 
print(f'table results: \n{table_results}\n’) 


# automatic BP test (LM version): 

y, X = pt.dmatrices('price ~ lotsize + sqrft + bdrms’, 
data-hpricel, return type-'dataframe') 

result bp lm = sm.stats.diagnostic.het breuschpagan (result 

bp lm statistic = result bp 1m[0] 

bp 1m pval = result bp lm[1] 

print(f'bp lm statistic: (bp lm statistic)Wn') 

print(f'bp lm pval: (bp 1m pval}\n’) 


resid, X) 


# manual BP test (F version): 

hpricel['resid sq'] = results.resid +*+ 2 

reg resid = smf.ols(formula-'resid sq ~ lotsize + sqrft + bdrms', data-hpricel) 
results resid - reg resid.fit() 

bp F statistic - results resid.fvalue 

bp F pval = results resid.f pvalue 

print(f'bp ] atistic: (bp F statistic)Win') 

print(f'bp F pval: (bp F pval)Wn') 


Output of Script 8.3: Example-8-4.py 


table results: 


b se t pval 
Intercept -21.7703 29.4750 -0.7386 0.4622 
lotsize 0.0021 0.0006 3.2201 0.0018 
sqrft 0.1228 0.0132 9.2751 0.0000 
bdrms 13.8525 9.0101 1.5374 0.1279 


bp_lm_statistic: 14.092385504350272 
bp_lm_pval: 0.002782059555689044 
bp_F_statistic: 5.338919363241436 


bp_F_pval: 0.002047744420936033 


The White test is a variant of the BP test where in the second stage, we do not regress the squared 
first-stage residuals on the original regressors only. Instead, we add interactions and polynomials of 
them or include the fitted values 9 and $?. This can easily be done in a manual second-stage regres- 
sion remembering that the fitted values are stored in the regression results object as £3ttedvalues. 
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Conveniently, we can also use the stats .diagnostic.het_breuschpagan command to do 
the calculations of the LM version of the test including the p values automatically. All we have to do 
is to explain that in the second stage we want a different set of regressors. 


Wooldridge, Example 8.5: BP and White test in the Log Housing Price Equation 


Script 8.4 (Examp1e-8-5.py) implements the BP and the White test for a model that now contains 
logarithms of the dependent variable and two independent variables. The LM versions of both the BP 
and the White test do not reject the null hypothesis at conventional significance levels with p values of 
0.238 and 0.178, respectively. 


Script 84: Example-8-5.py 
import wooldridge as woo 
import numpy as np 
import pandas as pd 
import statsmodels.api as sm 
import statsmodels.formula.api as smf 
import patsy as pt 


hpricel = woo.dataWoo('hpricel') 


# estimate model: 

reg = smf.ols(formula-'np.log(price) ~ np.log(lotsize) + np.log(sqrft) + bdrms', 
data-hpricel) 

results - reg.fit() 


# BP test: 

y, X bp = pt.dmatrices('np.log(price) ~ np.log(lotsize) + np.log(sqrft) + bdrms', 
data-hpricel, return type-'dataframe') 

result bp - sm.stats.diagnostic.het breuschpagan(results.resid, X bp) 

result bp[0] 

result bp[1] 

print(f'bp statistic: (bp statistic)Win') 

print(f'bp pval: (bp pval)Wn') 


# White test: 

X wh = pd.DataFrame(('const': 1, 'fitted reg': results.fittedvalues, 
fitted reg sq': results.fittedvalues ++ 2}) 

result white - sm.stats.diagnostic.het breuschpagan(results.resid, X wh) 

white statistic - result white[0] 

white pval = result white[l] 

print(f'white statistic: (white statistic)in') 

print(f'white pval: (white _pval}\n’) 


p — Output of Script 8.4: Example-8-5.py 
bp statistic: 4.223245741805286 


bp pval: 0.23834482631493 


white statistic: 3.4472865468750253 


white pval: 0.1784149479413317 
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8.3. Weighted Least Squares 


Weighted Least Squares (WLS) attempts to provide a more efficient alternative to OLS. It is a special 
version of a feasible generalized least squares (FGLS) estimator. Instead of the sum of squared 
residuals, their weighted sum is minimized. If the weights are inversely proportional to the variance, 
the estimator is efficient. Also the usual formula for the variance-covariance matrix of the parameter 
estimates and standard inference tools are valid. 

We can obtain WLS parameter estimates by multiplying each variable in the model with the square 
root of the weight as shown by Wooldridge (2019, Section 8.4). In statsmodels, it is more conve- 
nient to use the option weights-... of the command wls. This provides a more concise syntax 
and takes care of correct residuals, fitted values, predictions, and the like in terms of the orig- 
inal variables. In terms of methods and arguments, wis is very similar to the function ols. 


Wooldridge, Example 8.6: Financial Wealth Equation 


Script 8.5 (Example-8-6.py) implements both OLS and WLS estimation for a regression of financial 
wealth (nett fa) on income (inc), age (age), gender (male) and eligibility for a pension plan (e401k) 
using the data set 401ksubs. Following Wooldridge (2019), we assume that the variance is proportional 
to the income variable ine. Therefore, the optimal weight is dz which is given as w1s weight in the 
wis call. 


- Script 8.5: Example-8-6.py — 
import wooldridge as woo 
import pandas as pd 

import statsmodels.formula.api as smf 


k40lksubs = woo.dataWoo('401ksubs') 


# subsetting data: 
k401ksubs_sub = k40lksubs[k40lksubs['fsize'] == 1] 


# OLS (only for singles, 'fsize" : 

reg ols = smf.ols(formula-'nettfa ~ inc + I((age-25)**2) + male + e401k’, 
data=k401ksubs_sub) 

results_ols = reg ols. fit (cov_type=’HCO’) 


# print regression table: 

table ols = pd.DataFrame({’b’: round(results ols.params, 4), 
‘se’: round(results ols.bse, 4), 
't': round(results ols.tvalues, 4), 
'pval': round(results ols.pvalues, 4)}) 

print(f'table ols: \n{table_ols}\n’) 


# WLS: 

wls weight = list(1 / k40lksubs sub['inc']) 

reg wls = smf.wls(formula-'nettfa ~ inc + I((age-25)**2) + male + e401k’, 
weights-wls weight, data-k40lksubs sub) 

results wls - reg wls.fit() 


# print regression table: 

table wls = pd.DataFrame({’b’: round(results wls.params, 4), 
‘se’: round(results wls.bse, 4), 
't': round(results wls.tvalues, 4), 
'pval': round(results wls.pvalues, 4)}) 

print(f'table wls: \n{table_wls}\n’) 
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Output of Script 8.5: Example-8-6.py 


table_ols: 

b se t pval 
Intercept -20.9850 3.4909 -6.0114 0.0000 
inc 0.7706 0.0994 7.7486 0.0000 
I((age - 25) «« 2) 0.0251 0.0043 5.7912 0.0000 
male 2.4779 2.0558 1.2053 0.2281 
e401k 6.8862 2.2837 3.0153 0.0026 
table wls: 

b se t pval 
Intercept -16.7025 1.9580 -8.5304 0.0000 
inc 0.7404 0.0643 11.5140 0.0000 
I((age - 25) ** 2) 0.0175 0.0019 9.0796 0.0000 
male 1.8405 1.5636 1.1771 0.2393 
e401k 5.1883 1.7034 3.0458 0.0024 


We can also use heteroscedasticity-robust statistics from Section 8.1 to account for the fact that our 
variance function might be misspecified. Script 8.6 (WLS-Robust . py) repeats the WLS estimation 
of Example 8.6 but reports non-robust and robust standard errors and f statistics. It replicates 
Wooldridge (2019, Table 8.2) with the only difference that we use a refined version of the robust 
SE formula. There is nothing special about the implementation. The fact that we used weights is 
correctly accounted for in the following calculations. 


Script 8.6: WLS-Robust .py 
import wooldridge as woo 

import pandas as pd 

import statsmodels.formula.api as smf 


k401ksubs = woo.dataWoo ('401ksubs') 


# subsetting data: 
k40lksubs sub = k40lksubs[k4Olksubs['fsize'] == 1] 


# WLS: 

wls weight = list (1 / k40lksubs sub['inc']) 

reg wls = smf.wls(formula-'nettfa ~ inc + I((age-25)**2) + male + e401k’, 
weights-wls weight, data=k401ksubs_sub) 


# non-robust (default) results: 

results wls - reg wls.fit() 

table default = pd.DataFrame(('/b': round(results wls.params, 4), 
‘se’: round(results wls.bse, 4), 
't': round(results wls.tvalues, 4), 
'pval': round(results wls.pvalues, 4) }) 

print(f'table default: \n{table_default}\n’) 


# robust results (Refined White SE): 
results white - reg wls.fit(cov type-'HC3') 
table white = pd.DataFrame({’b’: round(results white.params, 4), 
round(results white.bse, 4), 
'/t': round(results white.tvalues, 4), 
'pval': round(results white.pvalues, 4)}) 
print(f'table white: \n{table_white}\n’) 
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Output of Script 8.6: WLS-Robust . py 
table_default: 


b se t — pval 
Intercept -16.7025 1.9580 -8.5304 0.0000 
inc 0.7404 0.0643 11.5140 0.0000 
I((age - 25) ++ 2) 0.0175 0.0019 9.0796 0.0000 
male 1.8405 1.5636 1.1771 0.2393 
e401k 5.1883 1.7034 3.0458 0.0024 


table white: 


b se t pval 

Intercept -16.7025 2.2482 -7.4292 0.0000 
inc 0.7404 0.0752 9.8403 0.0000 
I((age - 25) «« 2) 0.0175 0.0026 6.7650 0.0000 
male 1.8405 1.3132 1.4015 0.1611 
5.1883 1.5743 3.2955 0.0010 


e401k 


The assumption made in Example 8.6 that the variance is proportional to a regressor is usually 
hard to justify. Typically, we don't not know the variance function and have to estimate it. This 
feasible GLS (FGLS) estimator replaces the (allegedly) known variance function with an estimated 
one. 

We can estimate the relation between variance and regressors using a linear regression of the log of 
the squared residuals from an initial OLS regression log(ú?) as the dependent variable. Wooldridge 
(2019, Section 8.4) suggests two versions for the selection of regressors: 

* the regressors x;,..., x, from the original model similar to the BP test 


* jj and f? from the original model similar to the White test 


As the estimated error variance, we can use exp (log(ú?)). Its inverse can then be used as a weight 
in WLS estimation. 


Wooldridge, Example 8.7: Demand for Cigarettes 


Script 8.7 (Examp1e-8-7.py) studies the relationship between daily cigarette consumption cigs, indi- 
vidual characteristics, and restaurant smoking restrictions restaurn. After the initial OLS regression, a 
BP test is performed which clearly rejects homoscedasticity (see previous section for the BP test). After 
the regression of log squared residuals on the regressors, the FGLS weights are calculated and used in 
the WLS regression. See Wooldridge (2019) for a discussion of the results. 
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Script 8.7: Example-8-7.py 
import wooldridge as woo 

import numpy as np 

import pandas as pd 

import statsmodels.api as sm 

import statsmodels.formula.api as smf 

import patsy as pt 


smoke = woo.dataWoo (' smoke’ ) 


# OLS: 
reg ols = smf.ols(formula-'cigs ~ np.log(income) + np.log(cigpric) +’ 
'educ + age + I(age««2) + restaurn’, 
data=smoke) 
results ols = reg ols. fit () 
table ols = pd.DataFrame({’b’: round(results ols.params, 4), 
'se': round(results ols.bse, 4), 
't': round(results ols.tvalues, 4), 
'pval': round(results ols.pvalues, 4) }) 
print(f'table ols: \n{table_ols}\n’) 


# BP test: 
y, X = pt.dmatrices(’cigs ~ np.log(income) + np.log(cigpric) + educ +’ 
‘age + I(age«*2) + restaurn', 
data-smoke, return type-'dataframe') 
result bp - sm.stats.diagnostic.het breuschpagan(results ols.resid, X) 
bp statistic - result bp[0] 
bp pval - result bp[1] 
print(f'bp statistic: (bp 
print(f'bp pval: (bp pval)Wn') 


istic)Wn') 


# FGLS (estimation of the variance function): 

smoke['logu2'] = np.log(r id ++ 2) 

reg fgls = smf.ols(formula-'logu2 ~ np.log(income) + np.log(cigpric) +’ 
'educ + age + I(ages*2) + restaurn', data=smoke) 

results fgls = reg fgls.fit() 

table fgls = pd.DataFrame({’b’: 


round(results fgl. 
round(results fgls.bse, 4), 
't': round(results fgls.tvalues, 4), 
'pval': round(results fgls.pvalues, 4))) 
print(f'table fgls: \n{table_fgls}\n’) 


# FGLS (WLS): 

wls weight = list (1 / np.exp(results fgls.fittedvalues)) 

reg wls = smf.wls(formula-'cigs ~ np.log(income) + np.log(cigpric) +’ 
'educ + age + I(age**2) + restaurn', 

weights-wls weight, data=smoke) 

results wls - reg wls.fit() 

table wls = pd.DataFrame(('b': round(results wls.params, 4), 
'se': round(results wls.bse, 4), 
't': round(results wls.tvalues, 4), 
'pval': round(results wls.pvalues, 4)]) 

print(f'table wls: Wn(table wls)in') 
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Output of Script 8.7: Example-8-7.py 


table ols: 

b se t pval 
Intercept -3.6398 24.0787 -0.1512 0.8799 
np.log (income) 0.8803 0.7278 1.2095 0.2268 
np.log(cigpric) -0.7509 5.7733 -0.1301 0.8966 
educ -0.5015 0.1671 -3.0016 0.0028 
age 0.7707 0.1601 4.8132 0.0000 
(age ** 2) -0.0090 0.0017 -5.1765 0.0000 
restaurn -2.8251 1.1118 -2.5410 0.0112 
bp statistic: 32.25841908120121 
bp pval: 1.4557794830278942e-05 
table fgls: 

b se t — pval 
Intercept -1.9207 2.5630 -0.7494 0.4538 
np.log(income) 0.2915 0.0775 3.7634 0.0002 
np.log(cigpric) 0.1954 0.6145 0.3180 0.7506 
educ -0.0797 0.0178 -4.4817 0.0000 
age 0.2040 0.0170 11.9693 0.0000 
I(age «« 2) -0.0024 0.0002 -12.8931 0.0000 
restaurn -0.6270 0.1183 -5.2982 0.0000 
table wls: 

b se t — pval 
Intercept 5.6355 17.8031 0.3165 0.7517 
np.log(income) 1.2952 0.4370 2.9639 0.0031 
np.log(cigpric) -2.9403 4.4601 -0.6592 0.5099 
educ -0.4634 0.1202 -3.8570 0.0001 
age 0.4819 0.0968 4.9784 0.0000 
I(age «« 2) -0.0056 0.0009 -5.9897 0.0000 
restaurn -3.4611 0.7955 -4.3508 0.0000 


9. More on Specification and Data Issues 


This chapter covers different topics of model specification and data problems. Section 9.1 asks how 
statistical tests can help us specify the “correct” functional form given the numerous options we 
have seen in Chapters 6 and 7. Section 9.2 shows some simulation results regarding the effects of 
measurement errors in dependent and independent variables. Sections 9.3 covers missing values and 
how Python can deal with them. In Section 9.4, we briefly discuss outliers and Section 9.5, the LAD 
estimator is presented. 


9.1. Functional Form Misspecification 


We have seen many ways to flexibly specify the relation between the dependent variable and the 
regressors. An obvious question to ask is whether or not a given specification is the “correct” 
one. The Regression Equation Specification Error Test (RESET) is a convenient tool to test the null 
hypothesis that the functional form is adequate. 

Wooldridge (2019, Section 9.1) shows how to implement it using a standard F test in a second 
regression that contains polynomials of fitted values from the original regression. We already know 
how to obtain fitted values and run an F test, so the implementation is straightforward. Even more 
convenient is the boxed routine reset_ramsey from the module statsmodels. We just have 
to supply the regression we want to test (argument res) and the order of included polynomials 
(argument degree) and the rest is done automatically. 


Wooldridge, Example 9.2: Housing Price Equation 


Script 9.1 (Example-9-2-manual.py) implements the RESET test using the procedure described by 
Wooldridge (2019) for the housing price model. As previously, we get the fitted values from the original 
regression using £ittedvalues. Their polynomials are entered into the formula of the second regression. 
The F test is easily done using £_test as described in Section 4.3. 

The same results are obtained more conveniently using the command reset_ramsey in Script 9.2 
(Example-9-2-automatic.py) Both implementations deliver the same results: The test statistic is 
F = 4.67 with a p value of p = 0.012, so we reject the null hypothesis that this equation is correctly 
specified at a significance level of a = 5%. 
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Script 9.1: Example-9-2-manual.py 
import wooldridge as woo 
import pandas as pd 

import statsmodels.formula.api as smf 


hpricel = woo.dataWoo('hpricel') 


# original OLS: 
reg = smf.ols(formula-'price ~ lotsize + sqrft + bdrms’, data-hpricel) 
results = reg. fit () 


# regression for RESET test: 
hpricel['fitted sq'] = results.fittedvalues ++ 2 
hpricel['fitted cub'] = results.fittedvalues ** 3 
reg reset = smf.ols(formula-'price ~ lotsize + sqrft + bdrms +’ 

/ fitted sq + fitted cub', data-hpricel) 
results reset - reg reset.fit() 


# print regression table: 
table = pd.DataFrame({’b’ 


round(results_reset.params, 4), 
round(results reset.bse, 4), 
't': round(results reset.tval 4), 
'pval': round(results reset.pvalues, 4))) 
print(f'table: \n{table}\n’) 


# RESET test (HO: all coeffs including "fitted" are-0): 
hypotheses = ['fitted sq = 0’, 'fitted cub = 0'] 

ftest man = results reset.f test (hypothe: 
fstat man = ftest_man.statistic[0] [0] 
fpval_man = ftest_man.pvalue 


print(f'fstat man: (fstat man)Wn') 
print(f'fpval man: (fpval man)Wn') 


Output of Script 9.1: Example-9-2-manual.py 


table: 

b se t pval 
Intercept 166.0973 317.4325 0.5233 0.6022 
lotsize 0.0002 0.0052 0.0295 0.9765 
sqrft 0.0176 0.2993 0.0588 0.9532 
bdrms 2.1749 33.8881 0.0642 0.9490 
fitted_sq 0.0004 0.0071 0.0498 0.9604 
fitted cub 0.0000 0.0000 0.2358 0.8142 


fstat man: 4.668205534950367 


fpval man: 0.012021711442865948 
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LL Script 9.2: Example-9-2-automatic.py 
import wooldridge as woo 
import statsmodels.formula.api as smf 

import statsmodels.stats.outliers influence as smo 


hpricel = woo.dataWoo('hpricel') 


# original linear regression: 
reg = smf.ols(formula-'price ~ lotsize + sqrft + bdrms', data-hpricel) 
results - reg.fit() 


# automated RESET test: 

reset output - smo.reset ramsey(res-results, degree-3) 
. auto = reset output.statistic[0][0] 

.auto = reset output.pvalue 


print(f'fstat auto: {fstat_auto}\n’) 
print(f'fpval auto: {fpval_auto}\n’) 


m~~ Output of Script 9.2: Example-9-2-automatic.py 
fstat auto: 4.668205534948779 


fpval auto: 0.012021711442883014 


Wooldridge (2019, Section 9.1-b) also discusses tests of non-nested models. As an example, a test 
of both models against a comprehensive model containing all regressors is mentioned. Such a test 
can be implemented in stat smodels by the command anova 1m that we already discussed. Script 
9.3 (Nonnested-Test . py) shows this test in action for a modified version of Example 9.2. 

The two alternative models for the housing price are 


price = fo B1lotsize + Bysqrft + B4bdrms +u, (9.1) 
price = fo + fı log(lot size) + Bz log(sqrft) + Babdrms + u. (9.2) 


The output shows the test results of testing both models against the encompassing model with all 
variables. Both models are rejected against this comprehensive model. 


— — Script 9.3: Nonnested-Test .py 
import wooldridge as woo 
import numpy as np 
import statsmodels.api as sm 
import statsmodels.formula.api as smf 


hpricel = woo.dataWoo('hpricel') 


# two alternative models: 
regl = smf.ols(formula-'price ~ lotsize + sqrft + bdrms’, data-hpricel) 
results1 = regl.fit() 


reg2 = smf.ols(formula-'price ~ np.log(lotsize) +’ 
‘np.log(sqrft) + bdrms’, data-hpricel) 
results2 = reg2.fit() 
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# encompassing test of Davidson & MacKinnon: 

# comprehensive model: 

reg3 = smf.ols(formula-'price ~ lotsize + sqrft + bdrms + ’ 
'np.log(lotsize) + np.log(sqrft)', data-hpricel) 

results3 - reg3.fit() 


# model 1 vs. comprehensive model: 
anovaResults1 = sm.stats.anova lm(resultsl, results3) 
print (f’ anovaResultsl: \n{anovaResults1}\n’) 


# model 2 vs. comprehensive model: 
anovaResults2 = sm.stats.anova lm(results2, results3) 
print (f’ anovaResults2: \n{anovaResults2}\n’) 


Output of Script 9.3: Nonnested-Test . py 


anovaResultsl: 

df resid ssr df diff ss diff E Pr (>F) 
0 84.0 300723.805123 0.0 NaN NaN NaN 
1 82.0 252340.364481 2.0 48383.440642 7.861291 0.000753 
anovaResults2: 

df_resid ssr df diff ss diff F Pr(»F) 
0 84.0 295735.273607 0.0 NaN NaN NaN 
1 82.0 252340.364481 2.0 43394.909126 7.05076 0.001494 


9.2. Measurement Error 


If a variable is not measured accurately, the consequences depend on whether the measurement 
error affects the dependent or an explanatory variable. If the dependent variable is mismeasured, 
the consequences can be mild. If the measurement error is unrelated to the regressors, the parameter 
estimates get less precise, but they are still consistent and the usual inferences from the results are 
valid. 

The simulation exercise in Script 9.4 (Sim-ME-Dep.py) draws 10000 samples of size n = 1000 
according to the model with measurement error in the dependent variable 


y' =Bot+Pix+u, y=y" +e. (9.3) 


The assumption is that we do not observe the true values of the dependent variable y* but our 
measure y is contaminated with a measurement error ej. 
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Script 9.4: Sim-ME-Dep.py 


import numpy as np 
import scipy.stats as stats 

import pandas as pd 

import statsmodels.formula.api as smf 


# set the random seed: 
np. random, seed (1234567) 


# set sample size and number of simulations: 
n = 1000 
r = 10000 


# set true parameters (betas) : 
betad = 1 
betal = 0.5 


# initialize arrays to store results later (bl without ME, bl me with ME): 
bl = np.empty(r) 
bl me = np.empty (r) 


# draw a sample of x, fixed over replications: 
x = stats.norm.rvs(4, 1, size=n) 


# repeat r time: 
for i in range(r): 
# draw a sample of u: 
u = stats.norm.rvs(0, 1, size=n) 


# draw a sample of ystar: 
ystar = beta0 + betal + x + u 


# measurement error and mismeasured y: 
e0 stats.norm.rvs(0, 1, size=n) 

Y tar + e0 

df = pd.DataFrame(('ystar': ystar, " 


zy, 'x': x]) 


# regress ystar on x and store slope estimate at position i: 
reg star = smf.ols(formula-'ystar ~ x’, data-df) 

results star - reg star.fit() 

bl[i] = results star.params['x'] 


# regress y on x and store slope estimate at position i: 
reg me - smf.ols(formula-'y - x', data-df) 

results me - reg me.fit() 

bl me[i] = results me.params['x'] 


# mean with and without ME: 
bl mean = np.mean (b1) 

bl me mean - np.mean(bl me) 
print(f'bl mean: (bl mean) n') 
print(f'bl me mean: (bl me mean)in') 


# variance with and without ME: 
bl var = np.var(bl, ddof=1) 
bl me var - np.var(bl me, ddof-1) 
print(f'bl var: (bl var)Wn') 
print(f'bl me var: (bl me var)in') 
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~ Output of Script 9.4: Sim-ME-Dep.py 
bl_mean: 0.5002159846382418 


bl_me_mean: 0.4999676458235338 
bl_var: 0.0010335543409510668 


bl_me_var: 0.0020439380493408005 


In the simulation, the parameter estimates using both the correct y* and the mismeasured y are 
stored as the variables b1 and b1 me, respectively. As expected, the simulated mean of both vari- 
ables is close to the expected value of 6; = 0.5. Output 94 (Sim-ME-Dep.py) shows that the 
variance of b1 me is around 0.002 which is twice as high as the variance of b1. This was expected 
since in our simulation, u and e are both independent standard normal variables, so Var(u) = 1 and 
Var(u + eo) = 2. 

If an explanatory variable is mismeasured, the consequences are usually more dramatic. Even in 
the classical errors-in-variables case where the measurement error is unrelated to the regressors, the 
parameter estimates are biased and inconsistent. This model is 


y = Bot Bix" +u, x=x*+e (9.4) 


where the measurement error e; is independent of both x* and u. Wooldridge (2019, Section 9.4) 
shows that if we regress y on x instead of x*, 


Var(x*) 


Var(x*) + Var(e1) 65 


plimB; = 1 - 


The simulation in Script 9.5 (Sim-ME-Explan.py) draws 10000 samples of size n = 1000 from this 
model. 
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Script 9.5: Sim-ME-Explan.py 
import numpy as np 
import scipy.stats as stats 
import pandas as pd 
import statsmodels.formula.api as smf 


# set the random seed: 
np. random, seed (1234567) 


# set sample size and number of simulations: 
n = 1000 
r = 10000 


# set true parameters (betas) : 
betad = 1 
betal = 0.5 


# initialize bl arrays to store results later: 
bl = np.empty(r) 
bl me = np.empty (r) 


# draw a sample of x, fixed over replications: 
xstar = stats.norm.rvs(4, 1, size=n) 


it x time: 
in range(r): 
# draw a sample of u: 

u = stats.norm.rvs(0, 1, size=n) 


# draw a sample of y: 
y = betaO + betal + xstar + u 


# measurement error and mismeasured x: 

el stats.norm.rvs(0, 1, size=n) 

x = xstar + el 

df = pd.DataFrame(('y': y, 'xstar': xstar, 'x': x}) 


# regress y on xstar and store slope estimate at position i: 
reg star = smf.ols(formula-'y ~ xstar', data-df) 

results star - reg star.fit() 

bl[i] = results star.params['xstar'] 


# regress y on x and store slope estimate at position i: 
reg me - smf.ols(formula-'y - x', data-df) 

results me - reg me.fit() 

bl me[i] = results me.params['x'] 


# mean with and without ME: 
bl mean = np.mean (b1) 

bl me mean - np.mean(bl me) 
print(f'bl mean: (bl mean) n') 
print(f'bl me mean: (bl me mean)in') 


# variance with and without ME: 
bl var = np.var(bl, ddof=1) 
bl me var - np.var(bl me, ddof-1) 
print(f'bl var: (bl var)Wn') 
print(f'bl me var: (bl me var)in') 
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Output of Script 9.5: Sim-ME-Explan.py 
bl_mean: 0.5002159846382418 


bl_me_mean: 0.2445467197788616 
bl_var: 0.0010335543409510668 


bl_me_var: 0.0005435611029837354 


Since in this simulation, Var(x*) = Var(e;) = 1, Equation 9.5 implies that plimf; = 18, = 0.25. 
This is confirmed by the simulation results in Output 9.5 (Sim-ME-Explan.py). While the mean 
of the estimates in b1 using the correct regressor again is around 0.5, the mean parameter estimate 
using the mismeasured regressor is about 0.25. 


9.3. Missing Data and Nonrandom Samples 


In many data sets, we fail to observe all variables for each observational unit. An important case 
is survey data where the respondents refuse or fail to answer some questions. We use numpy to 
account for missing data by using its special value nan (not a number). It indicates that we do not 
have the information or the value is not defined. The latter is usually the result of operations like 8 
or the logarithm of a negative number. 

The function isnan(value) returns True if value is nan and False otherwise. Note that 
operations resulting in + like log(0) or j are not coded as nan but as inf or -inf. Script 9.6 
(NA-NaN-Inf.py) gives some examples. 


Script 9.6: NA-NaN-Inf.py 


import numpy as np 
import pandas as pd 
import scipy.stats as stats 


# nan and inf handling in numpy: 

x = np.array([-1, 0, 1, np.nan, np.inf, -np.inf]) 
logx = np.log(x) 

invx = np.array(1 / x) 

ncdf = np.array (stats.norm.cdf(x)) 

isnanx = np.isnan(x) 


results - pd.DataFrame(('x': x, 'logx': logx, 'invx': invx, 
‘logx’: logx, 'ncdf': ncdf, 'isnanx': isnanx)) 
print(f'results: \n{results}\n’) 


Output of Script 9.6: NA-NaN-Inf.py 


results 

x logx invx ncdf isnanx 
0 -1.0  NaN -1.0 0.158655 False 
1 0.0 -inf inf 0.500000 False 
2 1.0 0.0 1.0 0.841345 False 
3 NaN NaN NaN NaN True 
4 inf inf 0.0 1.000000 False 
5 -inf NaN -0.0 0.000000 False 


Depending on the data source, real-world data sets can have different rules for indicating missing 
information. Sometimes, impossible numeric values are used. For example, a survey including the 
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number of years of education as a variable educ might have a value like “9999” to indicate missing 
information. For any software package, it is highly recommended to change these to proper missing- 
value codes early in the data-handling process. Otherwise, we take the risk that some statistical 
method interprets those values as “this person went to school for 9999 years” producing highly 
nonsensical results. For the education example, if the variable educ is in the data frame mydata this 
can be done with 

mydata.loc[mydata['educ'] == 9999, 'educ'] = np.nan 


We can also create Boolean variables indicating missing values using the pandas method isna. 
For example mydata [' educ’ ] .isna() will generate a Boolean variable of the same length which 
is True whenever mydata [' educ’ ] is np. nan. It can also be used on data frames. The command 
mydata.isna() will return another data frame with the same dimensions and variable names but 
full of Boolean variables for missing observations. It is useful to count the missings for each variable 
in a data frame with 


missings - mydata.isna() 
missings.sum(axis-0) 


The argument axisz0 makes sure that summing over observations is done for each variable, and 
since an observation in this case is True (treated as 1 by sum) or False (treated as 0 by sum) this 
gives the total amount of missing values per variable. Following the same idea, axis=1 can be used 
to identify observations with no missing variables. Script 9.7 (Missings.py) demonstrates these 
commands for the data set LAWSCH85 which contains data on law schools. Of the 156 schools, 6 do 
not report median LSAT scores. Looking at all variables, the most missings are found for the age of 
the school — we don't know it for 45 schools. For only 90 of the 156 schools, we have the full set of 
variables, for the other 66, one or more variable is missing. 


Script 9.7: Missings.py 


import wooldridge as woo 
import pandas as pd 


lawsch85 = woo.dataWoo(’ lawsch85’ ) 
lsat pd = lawsch85['LSAT'] 


# create boolean indicator for missings: 
missLSAT - lsat pd.isna() 


# LSAT and indicator for Schools No. 120-129: 

preview = pd.DataFrame(('lsat pd': lsat pd[119:129], 
‘missLSAT’: missLSAT[119:129]}) 

print (f’ preview: \n{preview}\n’) 


# frequencies of indicator: 
freq missLSAT = pd.crosstab(missLSAT, columns=' count’) 
print (f’ freq missLSAT: Wn(freq missLSAT)Wn') 


# missings for all variables in data frame (counts): 
miss all = lawsch85.isna() 
colsums = miss all.sum(axi 
print(f'colsums: \n{colsums}\n’) 


# computing amount of complete cases: 

complete cases = (miss all.sum(axis-1) == 0) 

freq complete cases = pd.crosstab(complete cases, columns-'count') 
print(f'freq complete cases: in(freq complete cases) n') 
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preview: 

lsat_pd 
119 156.0 
120 159.0 
121 157.0 
122 167.0 
123 NaN 
124 158.0 
125 155.0 
126 157.0 
127 NaN 
128 163.0 


missLSAT 
False 
False 
False 
False 
True 
False 
False 
False 
True 
False 


freq_missLSAT: 
col_0 count 
LSAT 

False 150 
True 6 


colsums: 
rank 

salary 

cost 

LSAT 

GPA 

libvol 
faculty 

age 4 
clsize 

north 

south 

east 

west 

lsalary 
studfac 
toplo 

rll 25 

r26 40 

r4l 60 
llibvol 
lcost 

dtype: int64 


oP-ooooooooooutnseaoooo 


freq complete cases: 


col 0 count 
row 0 
False 66 


True 90 


Output of Script 9.7: Missings.py 
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The question how to deal with missing values is not trivial and depends on many things. Modules 
in Python offer different strategies. A very strict approach is used for numpy data types. For basic 
functions such as numpy’s mean function, we cannot calculate the average, if at least one value of a 
provided numpy array is missing. Instead we have to use the function nanmean. 

However, using the same mean function on pandas data types removes the observations with 
missing values and does the calculations for the remaining ones.’ This shows that you have to check 
the behavior of each module in the presence of missing data to avoid errors. 

The regression command ols removes missings by default and informs you just about the total 
number of complete observations used in the regression (also available in the output of summary). 
Script 9.8 (Missings-Analyses.py) gives examples of these features. 


p Script 9.8: Missings-Analyses.py 
import wooldridge as woo 
import numpy as np 

import statsmodels.formula.api as smf 


lawsch85 = woo.dataWoo ('lawsch85') 


# missings in numpy: 

x np = np.array(lawsch85['LSAT']) 
x np barl - np.mean(x np) 

x np bar2 = np.nanmean (x np) 
print(f'x np barl: (x np barl)n') 
print(f'x np bar2: (x np bar2)n') 


# missings in pandas: 

x pd = lawsch85['LSAT'] 

X pd barl = np.mean(x pd) 

X pd bar2 = np.nanmean (x pd) 
print (f/x pd barl: (x pd barl)Wn') 
print(f'x pd bar2: (x pd bar2)An') 


# observations and variabl 
print(f'lawsch85.shape: (lawsch85.shape)Wn') 


# regression (missin: 
reg = smf.ols(formul 
results - reg.fit() 

print(f'results.nobs: {results.nobs}\n’) 


are taken care of by default): 
np.log(salary) ~ LSAT + cost + age’, data=lawsch85) 


Output of Script 9.8: Missings-Analyses.py 
x np barl: nan 


x np bar2: 158.29333333333332 
X pd barl: 158.29333333333332 
X pd bar2: 158.29333333333332 
lawsch85.shape: (156, 21) 


results.nobs: 95.0 


"This is also true for the mean method in pandas. 
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9.4. Outlying Observations 


Wooldridge (2019, Section 9.5) offers a very useful discussion of outlying observations. One 
of the important messages from the discussion is that dealing with outliers is a tricky busi- 
ness. The module statsmodels offers a method get_influence() to automatically calculate 
all studentized residuals discussed there. These residuals become available under the attribute 
resid_studentized_external in the resulting object. For the R&D example from Wooldridge 
(2019), Script 9.9 (Out Liers . py) calculates them and reports the highest and the lowest number. It 
also generates the histogram with overlayed density plot in Figure 9.1. Especially the highest value 
of 4.55 appears to be an extremely outlying value. 


Script 9.9: Outliers .py 


import wooldridge as woo 
import numpy as np 

import statsmodels.api as sm 

import statsmodels.formula.api as smf 
import matplotlib.pyplot as plt 


rdchem = woo.dataWoo('rdchem') 
# OLS regression: 


reg = smf.ols(formula-'rdintens ~ sales + profmarg', data=rdchem) 
reg. fit () 


# studentized residua. for all observations: 
studres = results.get influence().resid studentized external 


# display extreme values: 

Studres max = np.max(studres) 
studres min - np.min(studr 
print(f'studres max: (studres max)Wn') 
print(f'studres min: {studres_min}\n’) 


# histogram (and overlayed density plot): 
kde = sm.nonparametric.KDEUnivariate (studres) 
kde. fit () 


plt.hist(studres, color-'grey', density=True) 
plt.plot(kde.support, kde.density, color-'black', linewidth-2) 
plt.ylabel('density') 

plt.xlabel('studres') 

plt.savefig('PyGraphs/Outliers.pdf') 


pM — — Output of Script 9.9: Outliers.py 
studres max: 4.555033421514247 


studres min: -1.8180393952811718 
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Figure 9.1. Outliers: Distribution of Studentized Residuals 
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9.5. Least Absolute Deviations (LAD) Estimation 


As an alternative to OLS, the least absolute deviations (LAD) estimator is less sensitive to outliers. 
Instead of minimizing the sum of squared residuals, it minimizes the sum of the absolute values of the 
residuals. 

Wooldridge (2019, Section 9.6) explains that the LAD estimator attempts to estimate the parameters 
of the conditional median Med(y|x;,...,x,) instead of the conditional mean E(y|x;,...,x,). This 
makes LAD a special case of quantile regression which studies general quantiles of which the median 
(=0.5 quantile) is just a special case. In statsmode1s, general quantile regression (and LAD as the 
special case) can easily be implemented with the command quant reg. It works very similar to ols 
for OLS estimation. 

Script 9.10 (LAD . py) demonstrates its application using the example from Wooldridge (2019, Ex- 
ample 9.8) and Script 9.9. Note that LAD inferences are only valid asymptotically, so the results in 
this example with n = 32 should be taken with a grain of salt. 


Script 9.10: LAD. py 
import wooldridge as woo 
import pandas as pd 
import statsmodels.formula.api as smf 


rdchem = woo.dataWoo('rdchem') 


.ols (formula=’rdintens ~ I( /1000) + profmarg’, data=rdchem) 
results ols = reg ols.fit() 
table ols - pd.DataFrame(('b': round(results ols.params, 4), 
‘si round(results_ols.bse, 4), 


pvi 
print(f'table ols: Wn(table ols)n') 


# LAD regression: 
reg lad = smf.quantreg(formula-'rdintens ~ I(sales/1000) + profmarg', data-rdchem) 
results lad - reg lad.fit(q-.5) 


table lad = pd.DataFrame({’b’: round(results lad.params, 4), 
‘se’: round(results lad.bse, 4), 
't': round(results lad.tvalues, 4), 
'pval': round(results lad.pvalues, 4) }) 
print(f'table lad: Wn(table lad)in') 


MM Output of Script 9.10: LAD.py 
table ols: 


b se t pval 
Intercept 2.6253 0.5855 4.4835 0.0001 
I(sales / 1000) 0.0534 0.0441 1.2111 0.2356 
profmarg 0.0446 0.0462 0.9661 0.3420 
table lad: 

b se t pval 
Intercept 1.6231 0.7012 2.3148 0.0279 
I(sales / 1000) 0.0186 0.0528 0.3529 0.7267 
profmarg 0.1179 0.0553 2.1320 0.0416 
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Regression Analysis with Time Series 
Data 


10. Basic Regression Analysis with Time 
Series Data 


Time series differ from cross-sectional data in that each observation (i.e. row in a data frame) cor- 
responds to one point or period in time. Section 10.1 introduces the most basic static time series 
models. In Section 10.2, we look into more technical details how to deal with time series data 
in Python. Other aspects of time series models such as dynamics, trends, and seasonal effects are 
treated in Section 10.3. 


10.1. Static Time Series Models 


Static time series regression models describe the contemporaneous relation between the dependent 
variable y and the regressors z;,...,z,. For each observation t = 1,...,n, a static equation has the 
form 

Yt = Bo + Bizu + +++ + Prze + ue (10.1) 


For the estimation of these models, the fact that we have time series does not make any practical 
difference. We can still use ols from statsmodels to estimate the parameters and the other 
tools for statistical inference. We only have to be aware that the assumptions needed for unbiased 
estimation and valid inference differ somewhat. Important differences to cross-sectional data are that 
we have to assume strict exogeneity (Assumption TS.3) for unbiasedness and no serial correlation 
(Assumption TS.5) for the usual variance-covariance formula to be valid, see Wooldridge (2019, 
Section 10.3). 


Wooldridge, Example 10.2: Effects of Inflation and Deficits on Interest Rates 


The data set INTDEF contains yearly information on interest rates and related time series between 1948 
and 2003. Script 10.1 (gxample-10-2.py) estimates a static model explaining the interest rate i3 with 
the inflation rate inf and the federal budget deficit def. There is nothing different in the implementation 
than for cross-sectional data. Both regressors are found to have a statistically significant relation to the 
interest rate. 

The example also demonstrates a practical problem: the variable names inf and def correspond 
to Python keywords that have a predefined meaning and syntax. Because we are interested in the 
variable and not in keywords, we have to use the Q function within the formula. 
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Script 10.1: Example-10-2.py 
import wooldridge as woo 
import pandas as pd 

import statsmodels.formula.api as smf 


intdef = woo.dataWoo('intdef') 


# linear regression of static model (Q function avoids conflicts with keywords): 
reg = smf.ols(formula-'i3 ~ Q("inf") + Q("def")', data-intdef) 
results - reg.fit() 


# print regression table: 
table = pd.DataFrame({’b’: round(results.params, 4), 
‘se’: round(results.bse, 4), 
't': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4)}) 
print(f'table: \n{table}\n’) 


Output of Script 10.1: Example-10-2.py 
table: 
b se t pval 
Intercept 1.7333 0.4320 4.0125 0.0002 
Q("inf") 0.6059 0.0821 7.3765 0.0000 
Q("def") 0.5131 0.1184 4.3338 0.0001 


10.2. Time Series Data Types in Python 


For calculations specific to times series such as lags, trends, and seasonal effects, we will have to 
explicitly define the structure of our data. We will use pandas variable types specific to time series 
data. The most important distinction is whether or not the data are equispaced. The observations 
of equispaced time series are collected at regular points in time. Typical examples are monthly, 
quarterly, or yearly data. 

Observations of irregular time series have varying distances. An important example are daily 
financial data which are unavailable on weekends and bank holidays. Another example is financial 
tick data which contain a record each time a trade is completed which obviously does not happen 
at regular points in time. Although we will mostly work with equispaced data, we will briefly 
introduce these types in Section 10.2.2. 


10.2.1. Equispaced Time Series in Python 


A convenient way to deal with equispaced time series in pandas is to store them as a data frame 
(i.e. the type DataFrame). To capture the time dimension, you assign an appropriate index. With 
equispaced time series this is especially convenient in pandas with the function date range. It 
has the four important arguments start, end, periods and freq that describe the time structure 
of the data: 
* start / end: Left/ right bound of first/ last observation is accepted in different formats. All 
examples create the same starting/ ending bound: 

- start-'1978-02' 

- start-'1978-02-01' 

- start-' 02/01/1978" 
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— start-'2/1/1978' 
* periods: Number of equispaced points in time you need to generate. 
* freq: Number of observations per time unit. Examples: 

— freq-'Y': Yearly data (at the end of a year) 

- freq=’ QS’: Quarterly data (at the beginning of a quarter) 

- freq-'M': Monthly data (at the end of a month) 


Because the data are equispaced, you have to specify three arguments and the remaining one is 
implied. Obviously, this procedure only works, if two consecutive rows represent two consecutive 


points in time in an ascending order. 


As an example, consider the data set named BARIUM. It contains monthly data on imports of 
barium chloride from China between February 1978 and December 1988. Wooldridge (2019, Example 
10.5) explains the data and background. Script 10.2 (Examp1e-Barium.py) demonstrates the use 
of date range and how Figure 10.1 was generated. The time axis is automatically formatted 


appropriately. 


pM — Script 10.2: Example-Barium.py 
import wooldridge as woo 

import pandas as pd 

import matplotlib.pyplot as plt 


barium = woo.dataWoo (‘barium’) 
T = len(barium) 


# monthly time series starting Feb. 1978: 
barium.index = pd.date range(start-'1978-02', periods-' 
print(f'barium["chnimp"].head(): \n{barium["chnimp"] .h 


freq-'M') 
moo 


# plot chnimp (default: index on the x-axis): 
plt.plot('chnimp', data-barium, color=’black’, linestyle-'-') 
plt.ylabel('chnimp') 

plt.xlabel('time') 

plt.savefig('PyGraphs/Example-Barium.pdf') 


Output of Script 10.2: Example-Barium.py ———————________ 


barium["chnimp"].head(): 
1978-02-28 220.462006 
1978-03-31 94.791997 
1978-04-30 219.357498 
1978-05-31 317.421509 
1978-06-30 114.639000 
Freq: M, Name: chnimp, dtype: float64 
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Figure 10.1. Time Series Plot: Imports of Barium Chloride from China 
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10.2.2. Irregular Time Series in Python 


For the remainder of this book, we will work with equispaced time series. But since irregular time 
series are important for example in finance, we will briefly introduce them here. The only thing 
changing is that you cannot use date_range to generate time stamps. Instead, these are provided 
in your data and you can assign them to the index of your pandas data frame. 

Daily financial data sets are important examples of irregular time series. Because of weekends and 
bank holidays, these data are not equispaced and each data point contains a time stamp - usually 
the date. To demonstrate this, we will briefly look at the module pandas_datareader introduced 
in Section 1.3.3. It can automatically download financial data from Yahoo Finance and other sources. 
In order to do so, we must know the ticker symbol of the stock or whatever we are interested in. It 
can be looked up at https: //finance.yahoo.com/lookup . 

For example, the symbol for the Dow Jones Industrial Average is ^DJI, Apple stocks have 
the symbol AAPL and the Ford Motor Company is simply abbreviated as F. Script 10.3 
(Example-StockData.py) demonstrates the import and the format of the imported data. 
They include information on opening, closing, high, and low prices as well as the trading volume 
and the adjusted (for events like stock splits and dividend payments) closing prices. We also print 
the first and last 5 rows of data, and plot the adjusted closing prices over time. 


Script 10.3: Example-StockData.py 
pdr 


import pandas dat. r 


import matplotlib.pyplot as plt 


4 download data for 'F' (= Ford Motor Company) and define start and end: 
tickers pry 

start date = ‘2014-01-01’ 

end date = ‘2015-12-31’ 


# use pandas datareader for the import: 
F data - pdr.data.DataReader(tickers, 'yahoo', start date, end date) 


# look at imported data: 
print(f'F data.head(): \n{F_data.head()}\n’) 
print(f'F data.tail(): \n{F_data.tail()}\n’) 


# time series plot of adjusted closing prices: 
plt.plot('Close', data-F data, color-'black', linestyle-'-') 
plt.ylabel('Ford Close Price’) 

plt.xlabel('time') 

plt.savefig('PyGraphs/Example-StockData .pdf') 
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Figure 10.2. Time Series Plot: Stock Prices of Ford Motor Company 
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Output of Script 10.3: Example-StockData.py 
F data.head(): 


Attributes Adj Close Close High ^ Low Open Volume 
Symbols F F F F F F 
Date 


2014-01-02 11.349146 15.44 15.45 15.28 15.42 31528500.0 
2014-01-03 11.400599 15.51 15.64 15.30 15.52 46122300.0 
2014-01-06 11.452051 15.58 15.76 15.52 15.72 42657600.0 
2014-01-07 11.305044 15.38 15.74 15.35 15.73 54476300.0 
2014-01-08 11.422651 15.54 15.71 15.51 15.60 48448300.0 


F data.tail(): 


Attributes Adj Close Close High Low Open Volume 
Symbols F F F F F F 
Date 


2015-12-24 11.299311 14.31 14.37 14.25 14.35  9000100.0 
2015-12-28 11.196661 14.18 14.34 14.16 14.28 13697500.0 
2015-12-29 11.236141 14.23 14.30 14.15 14.28 18867800.0 
2015-12-30 11.188764 14.17 14.26 14.12 14.23 13800300.0 
2015-12-31 11.125596 14.09 14.16 14.04 14.14 19881000.0 
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10.3. Other Time Series Models 
10.3.1. Finite Distributed Lag Models 


Finite distributed lag (FDL) models allow past values of regressors to affect the dependent variable. 
A FDL model of order q with an independent variable z can be written as 


Yt = My + Ógzi + Ò1Zt-1 + +++ + Ógziq + ue (10.2) 


Wooldridge (2019, Section 10.2) discusses the specification and interpretation of such models. For the 
implementation, we generate the q additional variables that reflect the lagged values z;1,...,2+—q 
and include them in the model formula of ols. The method shift (k) allows to generate the 
lagged variable z, ,. Be aware that this only works if rows are sorted in an ascending order by the 
time variable. If your data frame df looks different and time is the time variable, you have to run 
df.sort values (by-['time']) first. 


Wooldridge, Example 10.4: Effects of Personal Exemption on Fertility Rates 


The data set FERTIL3 contains yearly information on the general fertility rate gfx and the personal tax 
exemption pe for the years 1913 through 1984. Dummy variables for the second world war ww2 and 
the availability of the birth control pill piii are also included. Script 10.4 (Examp1e-10-4.py) shows 
the distributed lag model including contemporaneous pe and two lags. All pe coefficients are insignifi- 
cantly different from zero according to the respective t tests. In Script 10.5 (Example-10-4-cont.py)a 
usual F test implemented with £ test reveals that they are jointly significantly different from zero at a 
significance level of « = 5% with a p value of 0.012 (see ftest1). As Wooldridge (2019) discusses, this 
points to a multicollinearity problem. 


Script 10.4: Example-10-4.py 
import wooldr: 
import pandas as pd 
import statsmodels.formula.api as smf 


woo 


fertil3 = woo.dataWoo('fertil3') 
T - len(fertil3) 


# define yearly time series beginning in 1913: 
fertil3.index = pd.date range(start-/1913', periods-T, freq-'Y').year 


# add all lags of ‘pe’ up to order 2: 
fertil3['pe lagl'] = fertil3['pe'].shift(1) 
fertil3['pe lag2'] - fertil3['pe'].shift(2) 


# linear regression of model with lags: 
reg = smf.ols(formula-'gfr ~ pe + pe lagl + pe lag2 + ww2 + pill’, data=fertil3) 
results = reg. fit() 


# print regression table: 
table = pd.DataFrame(('b': round(results.params, 4), 
‘se’: round(results.bse, 4), 
't': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4)]) 
print(f'table: \n{table}\n’) 
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Output of Script 10.4: Example-10-4.py 


table: 

b se $ pval 
Intercept 95.8705 3.2820 29.2114 0.0000 
pe 0.0727 0.1255 0.5789 0.5647 
pe_lagl -0.0058 0.1557 -0.0371 0.9705 
pe_lag2 0.0338 0.1263 0.2679 0.7896 
ww2 -22.1265 10.7320 -2.0617 0.0433 
pill -31.3050 3.9816 -7.8625 0.0000 


The long-run propensity (LRP) of FDL models measures the cumulative effect of a change in the 
independent variable z on the dependent variable y over time and is simply equal to the sum of the 
respective parameters 

LRP = bp +ô ob. 


We can calculate it directly from the estimated regression model. For testing whether it is different 
from zero, we can again use the convenient £_test command. 


Wooldridge, Example 10.4: (continued) 


Script 10.5 (Examp1e-10-4-cont.py) calculates the estimated LRP to be around 0.1. According to an 
F test, it is significantly different from zero with a p value of around 0.001. 


Script 10.5: Example-10-4-cont.py 
import wooldridge woo 
import pandas as pd 


import statsmodels.formula.api as smf 


fertil3 = woo.dataWoo('fertil3') 
T = len(fertil3) 


# define yearly time series beginning in 1913: 
fertil3.index = pd.date range(start-'1913', periods-T, freq-'Y').year 


# add all lags of ‘pe’ up to order 2: 
fertil3['pe lagl'] = fertil3['pe'].shift(1) 
fertil3['pe lag2/] = fertil3['pe'].shift(2) 


# linear regression of model with lags: 
reg = smf.ols(formula-'gfr - pe + pe lagl + pe lag2 + ww2 + pill’, data-fertil3) 
results = reg. fit () 


# F test (HO: all pe coefficients are=0) : 

hypothesesl = ['pe = 0’, 'pe lagl = 0’, 'pe lag2 = 0'] 
ftestl = results.f test (hypotheses1) 

fstatl = ftestl.statistic[0][0] 

fpvall - ftestl.pvalue 


print(f'fstatl: (fstat1}\n’) 
print(f'fpvall: (fpvall)Wn') 


# calculating the LRP: 

b = results.params 

b pe tot = b[’pe’] + b['pe lagl'] + b[’pe_lag2’] 
print(f'b pe tot: (b pe tot}\n’) 
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# F test (HO: LRP-0): 
hypotheses2 - ['pe * pe lagl * pe lag2 - 0'] 
ftest2 = results.f test (hypotheses2) 

fstat2 = ftest2.statistic[0] [0] 

fpval2 = ftest2.pvalue 


print (f/fstat2: (fstat2)Wn') 
print (f’fpval2: (fpval2)Wn') 


p — — — — Output of Script 10.5: Example-10-4-cont.py 
fstatl: 3.9729640469785394 


fpvall: 0.011652005303126536 
b pe tot: 0.10071909027975469 


fstat2: 11.421238467853682 


fpval2: 0.0012408438602970466 


10.3.2. Trends 


As pointed out by Wooldridge (2019, Section 10.5), deterministic linear (and exponential) time trends 
are accounted for by adding the time measure as another independent variable. 


Wooldridge, Example 10.7: Housing Investment and Prices 


The data set HSEINV provides annual observations on housing investments invpe and housing 
prices price for the years 1947 through 1988. Using a double-logarithmic specification, Script 10.6 
(Example-10-7.py) estimates a regression model with and without a linear trend. The variable t is 
used to capture the time trend in the second regression. Forgetting to add the trend leads to the 
spurious finding that investments and prices are related. 

Because of the logarithmic dependent variable, the trend in invpe (as opposed to log invpc) is expo- 
nential. The estimated coefficient implies a 1% yearly increase in investments. 


p — — —— Script 10.6: Example-10-7.py - 
import wooldridge as woo 
import numpy as np 

import pandas as pd 

import statsmodels.formula.api as smf 


hseinv = woo.dataWoo ('hseinv') 


# linear regression without time trend: 
reg wot = smf.ols(formula-'np.log(invpc) ~ np.log(price)', data-hseinv) 
results wot - reg wot.fit() 


# print regression table: 
table wot - pd.DataFrame(('b': round(results wot.params, 4), 
'se': round(results wot.bse, 4), 
't': round(results wot.tvalues, 4), 
'pval': round(results wot.pvalues, 4))) 
print(f'table wot: \n{table_wot}\n’) 


202 10. Basic Regression Analysis with Time Series Data 


# linear regression with time trend (data set includes a time variable t): 
reg wt - smf.ols(formula-'np.log(invpc) - np.log(price) * t', data-hseinv) 
results wt - reg wt.fit() 


# print regression table: 
table wt = pd.DataFrame({’b’: round(results wt.params, 4), 
‘se’: round(results wt.bse, 4), 
't': round(results wt.tvalues, 4), 
'pval': round(results wt.pvalues, 4)]) 
print(f'table wt: \n{table_wt}\n’) 


Output of Script 10.6: Example-10-7.py 


table wot: 

b se t — pval 
Intercept -0.5502 0.0430 -12.7882 0.0000 
np.log(price) 1.2409 0.3824 3.2450 0.0024 
table wt: 

b se t — pval 
Intercept -0.9131 0.1356 -6.7328 0.0000 


np.log(price) -0.3810 0.6788 -0.5612 0.5779 
t 0.0098 0.0035 2.7984 0.0079 


10.3.3. Seasonality 


To account for seasonal effects, we add dummy variables for all but one (the reference) "season". So 
with monthly data, we can include eleven dummies, see Chapter 7 for a detailed discussion. 


Wooldridge, Example 10.11: Effects of Antidumping Filings 


The data in BARIUM were used in an antidumping case. They are monthly data on barium chloride 
imports from China between February 1978 and December 1988. Wooldridge (2019, Example 10.5) 
explains the data and background. When we estimate a model with monthly dummies, they do not 
have significant coefficients except the dummy for April which is marginally significant. An F test which. 
is not reported reveals no joint significance. 
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LL — Script 10.7: Example-10-11.py 
import wooldridge as woo 
import numpy as np 

import pandas as pd 
import statsmodels.formula.api as smf 


barium = woo.dataWoo (' barium’ ) 


# linear regression with seasonal effects: 
reg = smf.ols(formula-'np.log(chnimp) ~ np.log(chempi) + np.log(gas) +’ 
‘np.log(rtwex) + befile6 + affile6 + afdec6 +’ 
‘feb + mar + apr + may + jun + jul + 
‘aug + sep + oct + nov + dec’, 
data=barium) 
results = reg. fit() 


# print regression table: 

table = pd.DataFrame({’b’: round(results.params, 4), 
‘se’: round(results.bse, 4), 
't': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4)]) 

print(f'table: \n{table}\n’) 


Output of Script 10.7: Example-10-11.py 
table: 

b se t pval 
Intercept 16.7792 32.4286 0.5174 0.6059 
np.log(chempi) 3.2651 0.4929 6.6238 0.0000 
np. log (gas) -1.2781 1.3890 -0.9202 0.3594 
np.log(rtwex) 0.6630 0.4713 1.4068 0.1622 
befile6 0.1397 0.2668 0.5236 0.6016 
affile6 0.0126 0.2787 0.0453 0.9639 
afdec6 -0.5213 0.3019 -1.7264 0.0870 
feb -0.4177 0.3044 -1.3720 0.1728 
mar 0.0591 0.2647 0.2231 0.8239 
apr -0.4515 0.2684 -1.6822 0.0953 
may 0.0333 0.2692 0.1237 0.9018 
jun -0.2063 0.2693 -0.7663 0.4451 
jul 0.0038 0.2788 0.0138 0.9890 
aug -0.1571 0.2780 -0.5650 0.5732 
sep -0.1342 0.2677 -0.5012 0.6172 
oct 0.0517 0.2669 0.1937 0.8467 
nov -0.2463 0.2628 -0.9370 0.3508 
dec 0.1328 0.2714 0.4894 0.6255 


11. Further Issues in Using OLS with Time 
Series Data 


This chapter introduces important concepts for time series analyses. Section 11.1 discusses the gen- 
eral conditions under which asymptotic analyses work with time series data. An important require- 
ment will be that the time series exhibit weak dependence. In Section 11.2, we study highly persistent 
time series and present some simulation excercises. One solution to this problem is first differencing 
as demonstrated in Section 11.3. How this can be done in the regression framework is the topic of 
Section 11.4. 


11.1. Asymptotics with Time Series 


As Wooldridge (2019, Section 11.2) discusses, asymptotic arguments also work with time series data 
under certain conditions. Importantly, we have to assume that the data are stationary and weakly 
dependent (Assumption TS.1). On the other hand, we can relax the strict exogeneity assumption TS.3 
and only have to assume contemporaneous exogeneity (Assumption TS.3’). Under the appropriate 
set of assumptions, we can use standard OLS estimation and inference. 


Wooldridge, Example 11.4: Efficient Markets Hypothesis 

The efficient markets hypothesis claims that we cannot predict stock returns from past returns. In a 
simple AR(1) model in which returns are regressed on lagged returns, this would imply a population 
slope coefficient of zero. The data set NYSE contains data on weekly stock returns. 

Script 11.1 (Examp1e-11-4.py) shows the analyses. Regression 1 is the AR(1) model also discussed by 
Wooldridge (2019). Models 2 and 3 add second and third lags to estimate higher-order AR(p) models. 
In all models, no lagged value has a significant coefficient and also the F tests for joint significance (not 
included in the script) do not reject the efficient markets hypothesis. 
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Script 11.1: Example-11-4.py 
import wooldridge as woo 
import pandas as pd 

import statsmodels.formula.api as smf 


nyse = woo.dataWoo('nyse') 
nyse['ret'] = nyse['return'] 


# add all lags up to order 3: 

nyse['ret lagl'] - nyse['ret'].shift(1) 
nyse['ret lag2'] - nyse['ret'].shift(2) 
nyse['ret lag3'] = nyse['ret'].shift(3) 


# linear regression of model with lags: 

regl = smf.ols(formula-'ret ~ ret lagl', data-nyse) 

reg2 = smf.ols(formula-'ret ~ ret lagl + ret lag2', data-nyse) 

reg3 = smf.ols(formula-'ret ~ ret lagl + ret lag2 + ret lag3', data-nyse) 


resultsl - regl.fit() 
results2 - reg2.fit() 
results3 - reg3.fit() 


tablel - pd.DataFrame(('b': round(resultsl.params, 4), 

'si round(resultsl.bse, 4), 

't': round(resultsl.tvalues, 4), 
'pval': round(resultsl.pvalues, 4)}) 


print(f'tablel: \n{tablel}\n’) 


table2 - pd.DataFrame(('b': round(results2.params, 4), 
‘st round (results2.b: 


't': round(results2.tvalues, 4), 
/pval': round(results2.pvalues, 4)}) 
print(f'table2: \n{table2}\n’) 


table3 = pd.DataFrame(('b': round(results3.params 
round (results3.b 
round(results3.tvalues, 4), 

'pval': round(results3.pvalues, 4)]) 
print(f'table3: \n{table3}\n’) 


4), 
4), 


Output of Script 11.1: Example-11-4.py — 


tablel: 

b se t pval 
Intercept 0.1796 0.0807 2.2248 0.0264 
ret lagl 0.0589 0.0380 1.5490 0.1218 
table2: 

b se t pval 
Intercept 0.1857 0.0812 2.2889 0.0224 
ret lagl 0.0603 0.0382 1.5799 0.1146 


ret lag2 -0.0381 0.0381 -0.9982 0.3185 


table3: 

b se t pval 
Intercept 0.1794 0.0816 2.1990 0.0282 
ret_lagl 0.0614 0.0382 1.6056 0.1088 
ret lag2 -0.0403 0.0383 -1.0519 0.2932 
ret lag3 0.0307 0 0 


.0382 0.8038 0.4218 
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We can do a similar analysis for daily data. The module pandas_datareader introduced in 
Section 1.3.3 allows us to directly download daily stock prices from Yahoo Finance. Script 11.2 
(Example-EffMkts.py) downloads daily stock prices of Apple (ticker symbol AAPL) and stores 
them as a DataFrame object. From the prices p;, daily returns r; are calculated using the standard 
Tone py- pea 

rt = log(pi) -log(pi-i) = “T= —- 


1 

Note that in the script, we calculate the difference using the method diff. It calculates the difference 
from trading day to trading day, ignoring the fact that some of them are separated by weekends or 
holidays. Obviously, this procedure only works, if two consecutive rows represent two consecutive 
points in time. Figure 11.1 plots the returns of the Apple stock. Even though we now have n = 2267 
observations of daily returns, we cannot find any relation between current and past returns which 
supports (this version of) the efficient markets hypothesis. 
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Script 11.2: Example-EffMkts.py 
import numpy as np 
import pandas as pd 
import pandas_datareader as pdr 
import statsmodels.formula.api as smf 
import matplotlib.pyplot as plt 


# download data for 'AAPL' (= Apple) and define start and end: 
tickers - ['AAPL'] 

start date = ‘2007-12-31’ 

end date - '2016-12-31' 


# use pandas datareader for the import: 
AAPL data = pdr.data.DataReader(tickers, ‘yahoo’, start date, end date) 


# drop ticker symbol from column name: 
AAPL data.columns = AAPL data.columns.droplevel(level-1) 


# calculate return as the log difference: 
AAPL data['ret'] = np.log(AAPL data['Adj Close']).diff() 


# time series plot of adjusted closing prices: 
plt.plot('/ret', data-AAPL data, color-'black', linestyl 
plt.ylabel('Apple Log Returns') 

plt.xlabel('time') 
plt.savefig('PyGraphs/Example-EffMkts.pdf') 


# linear regression of models with lags: 


AAPL data['ret lagl'] = AAPL data['ret'].shift(1) 
AAPL data['ret lag2'] = AAPL data['ret'].shift(2) 
AAPL data['ret lag3'] = AAPL data['ret'].shift(3) 


regl = smf.ols(formula-'ret ~ ret lagl', data-AAPL data) 


reg2 - smf.ols(formula-'ret - ret lagl * ret lag2', data-AAPL data) 
reg3 = smf.ols(formula-'ret - ret lagl + ret lag2 + ret lag3', data-AAPL data) 
resultsl - regl.fit() 
results2 - reg2.fit() 
results3 - reg3.fit() 


4 print regression tables: 

tablel = pd.DataFrame(('b': round(resultsl.params, 4), 
'se': round(resultsl.bse, 4), 
't': round(resultsl.tvalues, 4), 
'pval': round(resultsl.pvalues, 4)]) 

print(f'tablel: \n{table1}\n’) 


table2 - pd.DataFrame(('b': round(results2.params, 4), 
'se': round(results2.bse, 4), 
't': round(results2.tvalues, 4), 
'pval': round(results2.pvalues, 4))) 
print(f'table2: \n{table2}\n’) 


table3 - pd.DataFrame(('b': round(results3.params, 4), 
'se': round(results3.bse, 4), 
't': round(results3.tvalues, 4), 
'pval': round(results3.pvalues, 4))) 
print(f'table3: \n{table3}\n’) 
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Output of Script 11.2: 


Example-EffMkts.py 


table 

b se t pval 

Intercept 0.0007 0.0004 1.5667 0.1173 

ret lagl -0.0034 0.0210 -0.1628 0.8707 
table2: 

b se t pval 

Intercept 0.0007 0.0004 1.6107 0.1074 

lagl -0.0035 0.0210 -0.1677 0.8668 

ret lag2 -0.0288 0.0210 -1.3722 0.1701 
table3: 

b se t pval 

Intercept 0.0007 0.0004 1.6909 0.0910 

lagl -0.0034 0.0210 -0.1618 0.8715 

lag2 -0.0303 0.0210 -1.4451 0.1486 

lag3 0.0054 0.0210 0.2569 0.7973 


Figure 11.1. Time Series Plot: Daily Stock Returns 2008-2016, Apple Inc. 
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11.2. The Nature of Highly Persistent Time Series 


The simplest model for highly persistent time series is a random walk. It can be written as 


yr yi te (11.1) 
— yocep tete te (11.2) 
where the shocks e;,...,e; are iid with a zero mean. It is a special case of a unit root process. 


Random walk processes are strongly dependent and nonstationary, violating assumption TS1’ re- 
quired for the consistency of OLS parameter estimates. As Wooldridge (2019, Section 11.3) shows, 
the variance of y, (conditional on yo) increases linearly with f: 


Var(yilyo) = 2 -t. (11.3) 


This can be easily seen in a simulation exercise. Script 11.3 (Simulate-RandomWalk.py) draws 
30 realizations from a random walk process with i.i.d. standard normal shocks e;. After initializing 
the random number generator, an empty figure with the right dimensions is produced. Then, the 
realizations of the time series are drawn in a loop.! In each of the 30 draws, we first obtain a sample 
of the n = 50 shocks e;,...,e5). The random walk is generated as the cumulative sum of the shocks 
according to Equation 11.2 with an initial value of yo = 0. The respective time series are then added 
to the plot. In the resulting Figure 11.2, the increasing variance can be seen easily. 


Script 11.3: Simulate-RandomWalk.py 
import numpy as np 
import scipy.stats as stats 
import matplotlib.pyplo! 


plt 


* 't the random seed: 
np.random.seed(1234567) 


# initialize plot: 

x range = np.linspace(0, 50, num=51) 
plt.ylim([-18, 18]) 

plt.xlim([O, 50]) 


# loop over draws: 
for r in range(0, 30): 

# i.i.d. standard normal shock: 
ts.norm.rvs(0, 1, size-51) 


# set first entry to 0 (gives y 0 = 0): 
e[0] = 0 


# random walk as cumulative sum of shocks: 
y = np.cumsum(e) 


# add line to graph: 
plt.plot(x range, y, color-'lightgrey', linestyle-'-') 


plt.axhline(linewidth-2, linestyle-'--', color-'black') 
plt.ylabel('y') 

plt.xlabel('time') 
plt.savefig('PyGraphs/Simulate-RandomWalk.pdf') 


|For a review of random number generation, see Section 1.6.4. 
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Figure 11.2. Simulations of a Random Walk Process 


15 


A simple generalization is a random walk with drift: 


yr = Mo + Vii ter (11.4) 
-— yoc ag-t-r ejt ea ceca +e. (11.5) 


Script 11.4 (Simulate-RandomWalkDrift.py) simulates such a process with ag = 2 and i.i.d. 
standard normal shocks e+. The resulting time series are plotted in Figure 11.3. The values fluctuate 
around the expected value «o - t. But unlike weakly dependent processes, they do not tend towards 
their mean, so the variance increases like for a simple random walk process. 
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Figure 11.3. Simulations of a Random Walk Process with Drift 
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- Script 11.4: Simulate-RandomWalkDrift .py 


import numpy as np 
import scipy.stats as stats 
import matplotlib.pyplot as plt 


# set the random seed: 
np. random. seed (1234567) 


# initialize plot: 

x range = np.linspace(0, 50, num=51) 
plt.ylim([0, 100]) 

plt.xlim([0, 50]) 


# loop over draws: 

for r in range(0, 30): 
# i.i.d. standard normal shock: 
e - stats.norm.rvs(0, 1, size-51) 


# set first entry to 0 (gives y 0 = 
e[0] = 0 


0): 


# random walk as cumulative sum of shocks plus drift: 
y = np.cumsum(e) + 2 + x range 


# add line to graph: 
plt.plot(x range, y, color-'lightgrey', linestyle-'-') 


plt.plot(x range, 2 * x range, linewidth-2, linestyle-'--', color-'black') 


plt 
plt 
plt. 


.ylabel('y') 
.Xlabel('time') 


savefig('PyGraphs/Simulate-RandomWalkDrift.pdf') 
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An obvious question is whether a given sample is from a unit root process such as a random walk. 
We will cover tests for unit roots in Section 18.2. 


11.3. Differences of Highly Persistent Time Series 


The simplest way to deal with highly persistent time series is to work with their differences rather 
than their levels. The first difference of the random walk with drift is 


yr = to Vii t €r (11.6) 

Ayr = yr — Vi = Ao + € (11.7) 
This is an iid. process with mean ag. Script 11.5 (Simulate-RandomWalkDrift-Diff.py)re- 
peats the same simulation as Script 114 (Simulate-RandomWalkDrift .py) but calculates the 
differences using y[1:51] - y[0:50]. From now on, we will use the more convenient method 
diff for the same task. The resulting series are shown in Figure 11.4. They have a constant mean of 
2, a constant variance of g? = 1, and are independent over time. 
p —— Script 11.5: Simulate-RandomWalkDrift-Diff.py 
import numpy as np 
import scipy.stats as stats 
import matplotlib.pyplot as plt 


# set the random seed: 
np.random. id (1234567) 


# initialize plot: 
x range = np.linspace(1, 50, num=50) 
plt.ylim([-1, 5]) 
plt.xlim([0, 50]) 


# loop over draws: 
for r in range(0, 30): 
# i.i.d. standard normal shock and cumulative sum of shocks: 
e - stats.norm.rvs(0, 1, size-51) 
e[0] = 0 
y 7 np.cumsum(2 * e) 


# first difference: 
Dy = y[1:51] - y[0:50] 


# add line to graph: 
plt.plot(x range, Dy, color-'lightgrey', linestyle-'-') 


plt.axhline(y-2, linewidth-2, linestyle- 
plt.ylabel('y') 

plt .xlabel (‘time’) 
plt.savefig('PyGraphs/Simulate-RandomWalkDrift-Diff.pdf') 


', colorz'black') 


11.4. Regression with First Differences 


Adding first differences to regression models is straightforward. You have to add the dependent 
or independent variable var as a first difference to your data before starting the usual ols com- 
mand. The same holds, if you want to combine differences with lags in your specifications. This is 
demonstrated in Example 11.6. 
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Figure 11.4. Simulations of a Random Walk Process with Drift: First Differences 
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As already mentioned, the methods shift and di ££ are helpful, but they require that consecutive 
rows represent two consecutive points in time. These commands do not use any time stamp you may 
have provided before. 


Wooldridge, Example 11.6: Fertility Equation 


We continue Example 10.4 and specify the fertility equation in first differences. Script 11.6 
(Example-11-6.py) shows the analyses. While the first difference of the tax exemptions has no 
significant effect, its second lag has a significantly positive coefficient in the second model. This is 
consistent with fertility reacting two years after a change of the tax code. 


Script 11.6: Example-11-6.py 
import wooldridge 
import pandas as pd 
import statsmodels.formula.api as smf 


woo 


fertil3 = woo.dataWoo(’ fertil3’) 
T = len(fertil3) 


# define time series (years only) beginning in 1913: 
fertil3.index = pd.date range(start-'1913', periods-T, freq-'Y').year 


# compute first difference: 
fertil3['gfr diffl'] - fertil3['gfr'].diff() 
fertil3['pe diffl'] = fertil3['pe'].diff() 
print(f'fertil3.head(): \n{fertil3.head()}\n’) 


# linear regression of model with first differences: 
regl = smf.ols(formula-'gfr diffl ~ pe diffl', data-fertil3) 
resultsl - regl.fit() 
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# print regression table: 
tablel = pd.DataFrame({’b’: round(resultsl.params, 4), 
‘se’: round(resultsl.bse, 4), 
/t': round(resultsl.tvalues, 4), 
‘pval’: round(results1.pvalues, 4)]) 
print(f'tablel: \n{tablel}\n’) 


# linear regression of model with lagged differences: 
fertil3['pe diffl lagl'] = fertil3['pe diffl'].shift(1) 
fertil3['pe diffl lag2'] = fertil3['pe diffl'].shift(2) 


reg2 = smf.ols(formula-'gfr diffl - pe diffl + pe diffl lagl + pe diffl lag2', 
data-fertil3) 
results2 - reg2.fit() 


# print regression table: 
table2 - pd.DataFrame(('b': round(results2.params, 4), 
‘se’: round(results2.bse, 4), 
't': round(results2.tvalues, 4), 
‘pval’: round(results2.pvalues, 4)}) 
print(f'table2: \n{table2}\n’) 


Output of Script 11.6: Example-11-6.py 
fertil3.head(): 


gfr pe year t cgfr 4 gfr 2 gfr diffl pe diffl 
1913 124.699997 0.00 1913 1 NaN NaN NaN NaN 
1914 126.599998 0.00 1914 2 NaN NaN 1.900002 0.00 
1915 125.000000 0.00 1915 3 NaN 124.699997 -1.599998 0.00 
1916 123.400002 0.00 1916 4 NaN 126.599998 -1.599998 0.00 
1917 121.000000 19.27 1917 5 NaN 125.000000 -2.400002 19.27 
[5 rows x 26 columns] 

tablel: 

b se t pval 


Intercept -0.7848 0.5020 -1.5632 0.1226 
pe_diffl -0.0427 0.0284 -1.5045 0.1370 


table2: 

b se t pval 
Intercept -0.9637 0.4678 -2.0602 0.0434 
pe diffl -0.0362 0.0268 -1.3522 0.1810 
pe diffl lagl -0.0140 0.0276 -0.5070 0.6139 
pe diffl lag2 0.1100 0.0269 4.0919 0.0001 


12. Serial Correlation and 
Heteroscedasticity in Time Series 
Regressions 


In Chapter 8, we discussed the consequences of heteroscedasticity in cross-sectional regressions. In 
the time series setting, similar consequences and strategies apply to both heteroscedasticity (with 
some specific features) and serial correlation of the error term. Unbiasedness and consistency of the 
OLS estimators are unaffected. But the OLS estimators are inefficient and the usual standard errors 
and inferences are invalid. 

We first discuss how to test for serial correlation in Section 12.1. Section 12.2 introduces efficient 
estimation using feasible GLS estimators. As an alternative, we can still use OLS and calculate stan- 
dard errors that are valid under both heteroscedasticity and autocorrelation as discussed in Section 
12.3. Finally, Section 12.4 covers heteroscedasticity and autoregressive conditional heteroscedasticity 
(ARCH) models. 


12.1. Testing for Serial Correlation of the Error Term 
Suppose we are worried that the error terms 1,5... in a regression model of the form 
yr = Bo + Bixn  Baxo +++ Byxic + ut (12.1) 


are serially correlated. A straightforward and intuitive testing approach is described by Wooldridge 
(2019, Section 12.3). It is based on the fitted residuals a; = y; — Bo - Bixn see Byxie which can 
be obtained in statsmodels with the attribute resid, see Section 2.2. 

To test for AR(1) serial correlation under strict exogeneity, we regress 14; on their lagged values 
fl, 4. If the regressors are not necessarily strictly exogenous, we can adjust the test by adding the 
original regressors x;,. ..,x;y to this regression. Then we perform the usual f test on the coefficient 
of fj 3. 

For testing for higher order serial correlation, we add higher order lags 64.2, @;-3,... as explana- 
tory variables and test the joint hypothesis that they are all equal to zero using either an F test or a 
Lagrange multiplier (LM) test. Especially the latter version is often called Breusch-Godfrey test. 
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Wooldridge, Example 12.2: Testing for AR(1) Serial Correlation 


We use this example to demonstrate the “pedestrian” way to test for autocorrelation which is actually 
straightforward and instructive. We estimate two versions of the Phillips curve: a static model 


inf; = Bo + Byunem + u 
and an expectation-augmented Phillips curve 
Ainf, = Bp + Byunem; + uj. 


Scripts 12.1 (Examp1e-12-2-Static.py) and 122 (Examp1e-12-2-ExpAug.py) show the analyses. Af- 
ter the estimation, the residuals are extracted with resid and regressed on their lagged values. We 
report standard errors and t statistics. While there is strong evidence for autocorrelation in the static 
equation with a t statistic of un = 493, the null hypothesis of no autocorrelation cannot be rejected in 


the second model with a t statistic of 0635 = —0.29. 


Script 12.1: Example-12-2-Static.py = 
import wooldridge as woo 
import pandas as pd 
import statsmodels.formula.api as smf 


phillips = woo.dataWoo('phillips') 
T = len(phillips) 


# define yearly time beginning in 1948: 
date range = pd.dai t-/1948', periods-T, freq-'Y') 
phillips.index - date range.year 


# estimation of static Phillips curve: 

yt96 - (phillips['year'] «- 1996) 

re smf.ols(formula-'Q("inf") - unem', data-phillips, subset-yt96) 
results s - reg s.fit() 


# residuals and AR(1) test: 
phillips['resid s'] = results s.resid 

phillips['resid s lagl'] = phillips['resid s'].shift(1) 

reg - smf.ols(formula-'resid s - resid s lagl', data-phillips, subset-yt96) 
results - reg.fit() 


4 print regression table: 

table - pd.DataFrame(('b': round(results.params, 4), 
'se': round(results.bse, 4), 
't': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4))) 

print(f'table: \n{table}\n’) 


Output of Script 12.1: Example-12-2-Static.py 
table: 


b se t — pval 
Intercept -0.1134 0.3594 -0.3155 0.7538 
resid s lagl 0.5730 0.1161 4.9337 0.0000 
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LL — Script 122: Example-12-2-ExpAug.py 
import wooldridge as woo 
import pandas as pd 

import statsmodels.formula.api as smf 


phillips = woo.dataWoo ('phillips') 
T - len(phillips) 


# define yearly time series beginning in 1948: 
date range - pd.date range(start-'1948', periods-T, freq-'Y') 
phillips.index - date range.year 


# estimation of expectations-augmented Phillips curve: 
yt96 = (phillips[’year’] <= 1996) 

phillips['inf diffl'] = phillips['inf'].diff() 

reg ea - smf.ols(formula-'inf diffl - unem', data-phillips, subset-yt96) 
results ea - reg ea.fit() 


d ea'] = results ea.resid 
id ea lagl'] = phillips['resid ea'].shift(1) 


phillips['r 
phillips['r 


reg - smf.ols(formula-'resid ea - resid ea lagl', data-phillips, subset-yt96) 
results - reg.fit() 


round(results.params, 4), 
round(results.bse, 4), 
't': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4))) 
print(f'table: \n{table}\n’) 


—————- Output of Script 12.2: Example-12-2-ExpAug.py |. 
table: 
b se t pval 
Intercept 0.1942 0.3004 0.6464 0.5213 
resid ea lagl -0.0356 0.1239 -0.2873 0.7752 


This class of tests can also be performed automatically in statsmodels. Given the regression 
results are stored in a variable results, the LM and F tests of AR(q) serial correlation can simply 
be tested using 


stats.diagnostic.acorr breusch godfrey(results, nlags-q) 


Wooldridge, Example 12.4: Testing for AR(3) Serial Correlation 


We already used the monthly data set BARIUM and estimated a model for barium chloride imports in 
Example 10.11. Script 12.3 (Examp1e-12-4.py) estimates the model and tests for AR(3) serial correlation 
using the manual regression approach and the command acorr breusch godfrey. The manual ap- 
proach gives exactly the results reported by Wooldridge (2019) while the built-in command differs very 
slightly because of a different implementation (for details, see the module documentation). 
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Script 123: Example-12-4.py 
import wooldridge as woo 
import pandas as pd 

import numpy as np 

import statsmodels.api as sm 

import statsmodels.formula.api as smf 


barium = woo.dataWoo('barium') 
T = len(barium) 


# monthly time series starting Feb. 1978: 
barium.index = pd.date range(start-'1978-02', periods-T, freq-'M') 


reg = smf.ols(formula-'np.log(chnimp) ~ np.log(chempi) + np.log(gas) +’ 
/np.log(rtwex) + befile6 + affile6 + afdec6', 
data-barium) 
results - reg.fit() 


# automatic test: 

bg result - sm.stats.diagnostic.acorr breusch godfrey(results, nlags-3) 
fstat auto = bg_result [2] 

fpval_auto = bg_result [3] 

print(f'fstat auto: {fstat_auto}\n’) 

print(f'fpval auto: (fpval auto) n') 


= id’ ] . shift (1) 
barium['r: d lag2'] = barium['resid'].shift(2) 
barium['resid lag3'] = barium['resid'].shift(3) 


reg manual = smf.ols(formula-'resid ~ resid lagl + resid lag2 + resid lag3 +’ 
'np.log(chempi) + np.log(gas) + np.log(rtwex) +’ 
'befile6 + affile6 + afdec6’, data-barium) 


results manual - reg manual.fit() 


fa 


hypoth 
ftest_manual 
fstat_manual 
fpval_manual 
print(f'fstat manual: {fstat_manual}\n’) 
print(f'fpval manual: (fpval manual)n') 


'resid lag3 - 0'] 


ftest manual.statistic[0][0] 
ftest manual.pvalue 


Output of Script 12.3: Example-12-4.py 


fstat auto: 5.124662239772493 


fpval auto: 0.0022637197671316277 
fstat manual: 5.122907054069368 


fpval manual: 0.0022898028329663344 


Another popular test is the Durbin-Watson test for AR(1) serial correlation. While the test statis- 
tic is pretty straightforward to compute, its distribution is non-standard and depends on the data. 
statsmodels includes the test statistic in the output of the summary command or offers the com- 
mand durbin watson. The test statistic ranges from 0 to 4, where 2 represents the case of no serial 
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correlation. A value towards 0 indicates positive serial correlation, a value towards 4 negative serial 
correlation. Given the CLM assumptions, p values can be calculated but they are not included in the 
output of this function. Instead we use the critical values reported in Wooldridge (2019) to perform 
the hypothesis tests. 

Script 12.4 (Example-DWtest .py) repeats Example 12.2 but conducts DW tests instead of the t 
tests. The conclusions are the same: For the static model, no serial correlation can be rejected at a 1% 
level with a test statistic of DW = 0.8027, because it is below the critical value of d; = 1.32. For the 
expectation augmented Phillips curve, the null hypothesis cannot be rejected at a 5% level because 
DW = 1.7696 is greater than dy = 1.59. 


Script 124: Example-DWtest.py 
import wooldridge as woo 

import pandas as pd 

import statsmodels.api as sm 

import statsmodels.formula.api as smf 


phillips = woo.dataWoo ('phillips') 
T = len(phillips) 


# define yearly time series beginning in 1948: 
date range = pd.date range(start-'1948', periods-T, freq-'Y') 
phillips.index - date range.year 


# estimation of both Phillips curve models: 

yt96 - (phillips['year'] «- 1996) 

phillips['inf diffl'] - phillips['inf'].diff() 

Q("inf") - unem', data-phillips, subs 
inf diffl ~ unem', data-phillips, subi 


'tattools.durbin watson (ri 
sm.stats.stattools.durbin watson (result 
print (£/ DW. {DW_s}\n’) 

print(f'DW ea: {DW_ea}\n’) 


p — — — —— Output of Script 12.4: Example-DWtest.py 
DW s: 0.802700467848626 


DW ea: 1.7696478574549568 
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12.2. FGLS Estimation 


There are several ways to implement the FGLS methods for serially correlated error terms in Python. 
A simple way is provided by the module stat smode1s with its command GLSAR. It expects matri- 
ces of dependent and independent variables and reports the Cochrane-Orcutt estimator as demon- 
strated in Example 12.5. 


Wooldridge, Example 12.5: Cochrane-Orcutt Estimation 

We once again use the monthly data set BARIUM and the same model as before. Script 12.5 
(Example-12-5.py) estimates the model with OLS and then calls GLsaR. As expected, the results are 
very close to the Prais-Winsten estimates reported by Wooldridge (2019). 


Script 12.5: Example-12-5.py 
import wooldridge as woo 
import pandas as pd 
import numpy as np 
import statsmodels.api as sm 
import patsy as pt 


barium = woo.dataWoo (‘barium’) 
T = len(barium) 


# monthly time series starting Feb. 1978: 
barium.index = pd.date range(start-'1978-02', periods-T, freq-'M') 


4 perform the Cochrane-Orcutt estimation (iterative procedure 
y, X = pt.dmatrices('np.log(chnimp) - np.log(chempi) + np.log( 
'np.log(rtwex) + befile6 + affile6 + afdec6', 
data-barium, return type-'dataframe') 
reg = sm.GLSAR(y, X) 
CORC results g.iterative fit(maxit 
table - pd.DataFrame(('b CORC': CORC result: 
'se CORC': CORC_results.bse}) 
print(f'reg.rho: {reg.rho}\n’) 
print(f'table: \n{table}\n’) 


Output of Script 12.5: Example-12-5.py - 


[0.29585313] 

b CORC se CORC 

Intercept -37.512978 23.239015 
np.log(chempi) 2.945448 0.647696 
np.log (gas) 1.063321 0.991558 
np. log (rtwex) 1.138404 0.514910 
befile6 -0.017314 0.321390 
affile6 -0.033108 0.323806 
0.344075 


afdec6 -0.577328 
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12.3. Serial Correlation-Robust Inference with OLS 


Unbiasedness and consistency of OLS are not affected by heteroscedasticity or serial correlation, 
but the standard errors are. Similar to the heteroscedasticity-robust standard errors discussed in 
Section 8.1, we can use a formula for the variance-covariance matrix, often referred to as Newey- 
West standard errors. The module statsmodels provides the formula in the method fit as the 
option cov type = ‘HAC’. The argument cov kwds specifies further details like the order of 
considered serial correlation (labeled g in Wooldridge (2019)). After that, reported standard errors, t 
statistics and their p values are based on the robust variance-covariance matrix. 


Wooldridge, Example 12.1: The Puerto Rican Minimum Wage 


Script 12.6 (Examp1e-12-1.py) estimates a model for the employment rate depending on the mini- 
mum wage as well as the GNP in Puerto Rico and the US. After the model has been fitted by OLS, we 
provide regression coefficients and standard errors using the usual variance-covariance formula. With 
the option cov type = ‘HAC’ and cov kwds = ('maxlags': 2}, we get the results for the HAC 
variance-covariance formula. Both results imply a significantly negative relation between the minimum 
wage and employment. 


p Script 12.6: Example-12-1.py 
import wooldridge as woo 
import pandas as pd 

import numpy as np 

import statsmodels.formula.api as smf 


prminwge = woo.dataWoo ('prminwge') 
T = len(prminwge) 

prminwge[’time’] = prminwge['year'] - 1949 

prminwge.index = pd.date range(start-'/1950', periods-T, freq-'Y').year 


reg - smf.ols(formula-'np.log(prepop) - np.log(mincov) * np.log(prgnp) *' 
‘np.log(usgnp) + time’, data-prminwge) 


# results with regular SE: 
results regu = reg. fit () 


sion table: 
pd.DataFrame(('b' 
's 


round(results regu.params, 4), 
round(results regu.bse, 4), 
'/t': round(results regu.tvalues, 4), 
'pval': round(results regu.pvalues, 4)]) 
print(f'table regu: \n{table_regu}\n’) 


# results with HAC SE: 
results hac = reg.fit(cov type-'HAC', cov kwds-('maxlags': 2}) 


# print regression table: 
table hac = pd.DataFrame(('b': round(results hac.params, 4), 
'se': round(results hac.bse, 4), 
't': round(results hac.tvalues, 4), 
'pval': round(results hac.pvalues, 4))) 
print(f'table hac: \n{table_hac}\n’) 
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Output of Script 12.6: Example-12-1.py 


table regu: 

b se t — pval 
Intercept -6.6634 1.2578 -5.2976 0.0000 
np.log(mincov) -0.2123 0.0402 -5.2864 0.0000 
np.log(prgnp) 0.2852 0.0805 3.5437 0.0012 
np.log(usgnp) 0.4860 0.2220 2.1896 0.0357 
time -0.0267 0.0046 -5.7629 0.0000 
table hac: 

b se t — pval 
Intercept -6.6634 1.4318 -4.6539 0.0000 
np.log(mincov) -0.2123 0.0426 -4.9821 0.0000 
np.log(prgnp) 0.2852 0.0928 3.0720 0.0021 
np.log(usgnp) 0.4860 0.2601 1.8687 0.0617 
time -0.0267 0.0054 -4.9710 0.0000 


12.4. Autoregressive Conditional Heteroscedasticity 


In time series, especially in financial data, a specific form of heteroscedasticity is often present. 
Autoregressive conditional heteroscedasticity (ARCH) and related models try to capture these effects. 
Consider a basic linear time series equation 


Yt = Bo + PiXn Baxo +++ By + ue (122) 


The error term u follows a ARCH process if 


E(u |up—1, 42, .) = &9 + uP). (12.3) 


As the equation suggests, we can estimate ao and a by an OLS regression of the residuals 4? on 
fit a. 


Wooldridge, Example 12.9: ARCH in Stock Returns 


Script 12.7 (Examp1e-12-9.py) estimates a simple AR(1) model for weekly NYSE stock returns, already 
studied in Example 11.4. After the squared residuals are obtained, they are regressed on their lagged 
values. The coefficients from this regression are estimates for ay and a,. 
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Script 127: Example-12-9.py 
import wooldridge as woo 
import pandas as pd 

import statsmodels.formula.api as smf 


nyse = woo.dataWoo (’ nyse’) 
nyse[’ret’] = nyse['return'] 
nyse['ret lagl'] = nyse[’ret’].shift (1) 


# linear regression of model: 
reg = smf.ols(formula-'ret ~ ret lagl', data-nyse) 
results = reg.fit() 


# squared residuals: 
nyse['/resid sq'] = results.resid «« 2 
d sq lagl'] = nyse['resid sq'].shift(1) 


ARCHreg - smf.ols(formula-'resid sq - resid sq lagl', data-nyse) 
results ARCH - ARCHreg.fit() 


# print regression table: 
table - pd.DataFrame(('b': round(results ARCH.params, 4), 
‘se’: round(results ARCH.bse, 4), 

'/t': round(results ARCH.tvalues, 4), 
'pval': round(results ARCH.pvalues, 4))) 


print(f'table: \n{table}\n’) 


Output of Script 12.7: Example-12-9.py 
table: 
b se t pval 
Intercept 2.9474 0.4402 6.6951 0.0 
resid sq lagl 0.3371 0.0359 9.3767 0.0 


As a second example, let us reconsider the daily stock returns from Script 11.2 
(Example-EffMkts.py) We again download the daily Apple stock prices from Yahoo Fi- 
nance and calculate their returns. Figure 11.1 on page 209 plots them. They show a very typical 
pattern for an ARCH-type of model: there are periods with high (such as fall 2008) and other 
periods with low volatility (fall 2010). In Script 12.8 (Example-ARCH.py), we estimate an AR(1) 
process for the squared residuals. The t statistic is larger than 8, so there is very strong evidence for 
autoregressive conditional heteroscedasticity. 
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Script 12.8: Example-ARCH.py 
import numpy as np 
import pandas as pd 
import pandas datareader as pdr 
import statsmodels.formula.api as smf 


# download data for 'AAPL' 
tickers - ['AAPL'] 

start date = '2007-12-31' 
end date = '2016-12-31" 


Apple) and define start and end: 


# use pandas datareader for the import: 
AAPL data - pdr.data.DataReader(tickers, 'yahoo', start date, end date) 


# drop ticker symbol from column name: 
AAPL data.columns - AAPL data.columns.droplevel(level-1) 


# calculate return as the difference of logged prices: 
AAPL data['/ret'] = np.log(AAPL data['Adj Close']).diff() 
AAPL data['ret lagl'] = AAPL data['ret'].shift(1) 


# AR(1) model for returns: 
reg = smf.ols(formula-'re! 
results - reg.fit() 


^ ret lagl', data-AAPL data) 


# squared residuals: 
AAPL_data[’resid_sq’] = results.resid +*+ 2 
AAPL data['/resid sq lagl’] = AAPL data['r 


d sq'].shift(1) 


# model for squared residuals: 
ARCHreg = smf.ols(formula-'resid sq ~ resid sq lagl', data-AAPL data) 
results ARCH - ARCHreg.fit() 


# print regression table: 

table - pd.DataFrame(('b': round(results ARCH.params, 4), 
'se': round(results ARCH.bse, 4), 
/t': round(results ARCH.tvalues, 4), 
'pval': round(results ARCH.pvalues, 4)]) 

print(f'table: \n{table}\n’) 


Output of Script 12.8: Example-ARCH.py 
table: 
b se t pval 
Intercept 0.0003 0.0000 12.1550 0.0 
resid sq lagl 0.1722 0.0207 8.3182 0.0 
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Advanced Topics 


13. Pooling Cross-Sections Across Time: 
Simple Panel Data Methods 


Pooled cross sections consist of random samples from the same population at different points in 
time. Section 13.1 introduces this type of data set and how to use it for estimating changes over 
time. Section 13.2 covers difference-in-differences estimators, an important application of pooled 
cross-sections for identifying causal effects. 

Panel data resemble pooled cross sectional data in that we have observations at different points in 
time. The key difference is that we observe the same cross-sectional units, for example individuals 
or firms. Panel data methods require the data to be organized in a systematic way, as discussed in 
Section 13.3. Section 13.4 introduces the first panel data method, first differenced estimation. 


13.1. Pooled Cross-Sections 


If we have random samples at different points in time, this does not only increase the overall sample 
size and thereby the statistical precision of our analyses. It also allows to study changes over time 
and shed additional light on relationships between variables. 


Wooldridge, Example 13.2: Changes to the Return to Education and the 
Gender Wage Gap 


The data set cps78 85 includes two pooled cross-sections for the years 1978 and 1985. The dummy 
variable y85 is equal to one for observations in 1985 and to zero for 1978. We estimate a model for the 
log wage 1wage of the form 


lwage = fp + dyy85 + Bieduc + 6;(y85 - educ) + Baexper + p3 56 
+ Baunion + Bsfemale + ôs(y85 - female) +u. 


Note that we divide exper? by 100 and thereby multiply B3 by 100 compared to the results reported in 
Wooldridge (2019). The parameter B; measures the return to education in 1978 and à; is the difference 
of the return to education in 1985 relative to 1978. Likewise, Bs is the gender wage gap in 1978 and ds is 
the change of the wage gap. 

Script 13.1 (Example-13-2.py) estimates the model. The return to education is estimated to have 
increased by ô; = 0.0185 and the gender wage gap decreased in absolute value from Bs = —0.3167 
to Bs + ôs = —0.2316, even though this change is only marginally significant. The interpretation and 
implementation of interactions were covered in more detail in Section 6.1.6. 
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Script 13.1: Example-13-2.py 
import wooldridge as woo 
import pandas as pd 

import statsmodels.formula.api as smf 


cps78 85 = woo.dataWoo('cps78 85') 


# OLS results including interaction terms: 
reg = smf.ols(formula-'lwage ~ y85*(eductfemale) + exper +’ 
'I((exper««2)/100) + union’, 
data=cps78_85) 
results = reg.fit() 


# print regression table: 
table = pd.DataFrame(('b': round(results.params, 4), 
‘se’: round(results.bse, 4), 
't': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4))) 
print(f'table: \n{table}\n’) 


Output of Script 13.1: Example-13-2.py 


table: 

b se t — pval 
Intercept 0.4589 0.0934 4.9111 0.0000 
y85 0.1178 0.1238 0.9517 0.3415 
educ 0.0747 0.0067 11.1917 0.0000 
female -0.3167 0.0366 -8.6482 0.0000 
y85:educ 0.0185 0.0094 1.9735 0.0487 
y85:female 0.0851 0.0513 1.6576 0.0977 
exper 0.0296 0.0036 8.2932 0.0000 
I((exper «« 2) / 100) -0.0399 0.0078 -5.1513 0.0000 
union 0.2021 0.0303 6.6722 0.0000 


13.2. Difference-in-Differences 


Wooldridge (2019, Section 13.2) discusses an important type of application for pooled cross-sections. 
Difference-in-differences (DiD) estimators estimate the effect of a policy intervention (in the broadest 
sense) by comparing the change over time of an outcome of interest between an affected and an 
unaffected group of observations. 

Ina regression framework, we regress the outcome of interest on a dummy variable for the affected 
("treatment") group, a dummy indicating observations after the treatment and an interaction term 
between both. The coefficient of this interaction term can then be a good estimator for the effect of 
interest, controlling for initial differences between the groups and contemporaneous changes over 
time. 


Wooldridge, Example 13.3: Effect of a Garbage Incinerator's Location on 
Housing Prices 


We are interested in whether and how much the construction of a new garbage incinerator af- 
fected the value of nearby houses. Script 13.2 (Example-13-3-1.py) uses the data set KIELMC. We 
first estimate separate models for 1978 (before there were any rumors about the new incinerator) 
and 1981 (when the construction began). In 1981, the houses close to the construction site were 
cheaper by an average of $30,688.27. But this was not only due to the new incinerator since even 
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in 1978, nearby houses were cheaper by an average of $18,824.37. The difference of these differences 
6 = $30,688.27 — $18,824.37 = $11,863.90 is the DiD estimator and is arguably a better indicator of the 
actual effect. 

The DID estimator can be obtained more conveniently using a joint regression model with the interaction 
term as described above. The estimator 5 = $11,863.90 can be directly seen as the coefficient of the 
interaction term. Conveniently, standard regression tables include t tests of the hypothesis that the 
actual effect is equal to zero. For a one-sided test, the p value is 10.113 = 0.056, so there is some 
statistical evidence of a negative impact. 

The DID estimator can be improved. A logarithmic specification is more plausible since it implies a 
constant percentage effect on the house values. We can also add additional regressors to control for 
incidental changes in the composition of the houses traded. Script 13.3 (Examp1e-13-3-2.py) imple- 
ments both improvements. The model including features of the houses implies an estimated decrease 
in the house values of about 13.2%. This effect is also significantly different from zero. 


M — — — —— Script 13.2: Example-13-3-1.py 
import wooldridge as woo 
import pandas as pd 

import statsmodels.formula.api as smf 


kielmc = woo.dataWoo ('kielmc') 


# separate regressions for 1978 and 1981: 
y78 - (kielmc['year'] -- 1978) 

reg78 = smf.ols(formula-'rprice ~ nearinc’, data-kielmc, subset=y78) 
results78 = reg78.fit() 


y81 = (kielmc['year'] == 1981) 
reg81 = smf.ols(formula-'rprice ~ nearinc’, data=kielmc, sub: 
results81 = reg81.fit() 


=y81) 


# joint regression including an interaction term: 
reg joint = smf.ols(formula-'rprice ~ nearinc + C(year)', data-kielmc) 
results joint - reg joint.fit() 


# print regression tables: 

table 78 = pd.DataFrame({’b’: round(results78.params, 4), 
‘se’: round(results78.b 4), 
't': round(results78.tvalues, 4), 
'pval': round(results78.pvalues, 4))) 

print(f'table 78: \n{table_78}\n’) 


table 81 = pd.DataFrame(('b': round(results81.params, 4), 
'se': round(results81.bse, 4), 
't': round(results81.tvalues, 4), 
'pval': round(results81.pvalues, 4))) 
print(f'table 81: \n{table_81}\n’) 


table joint = pd.DataFrame(('b': round(results joint.params, 4), 
‘se’: round(results joint.bse, 4), 
't': round(results joint.tvalues, 4), 
'pval': round(results joint.pvalues, 4))) 
print(f'table joint: \n{table_joint}\n’) 
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Output of Script 13.2: Example-13-3-1.py 


import wooldridge as woo 
import numpy as np 
import pandas as pd 
import statsmodels.formula.api as smf 


kielmc = woo.dataWoo('kielmc') 


# difference in difference (DiD): 
reg did smf.ols(formula-'np.log(rprice) ~ nearinc*C(year)’ 
results did = reg did.fit() 


# print regression table: 
table did - pd.DataFrame(('b': round(results did.params, 
'se': round(results did.bse, 4), 
't': round(results did.tvalu 4) 
'pval': round(results did.pvalues, 
print(f'table did: \n{table_did}\n’) 


# DiD with control variabl 
reg didC = smf.ols(formula-'np.log(rprice) ~ nearinc*C(year) 
'I(age**2) + np.log(intst) + np.l 
'np.log(area) + rooms + baths’, 
data-kielmc) 
eg didC.fit() 


results didC - 


4 print regression table: 

table didC = pd.DataFrame(('b': round(results didC.params, 4 
‘se’: round(results didC.bse, 4), 
't': round(results didC.tvalues, 
'pval': round(results didC.pvalue. 

print(f'table didC: \n{table_didc}\n’) 


table 78: 

b se t pval 
Intercept 82517.2276 2653.790 31.0941 0.0000 
nearinc -18824.3705 4744.594 -3.9675 0.0001 
table_81: 

b se t pval 
Intercept 101307.5136 3093.0267 32.7535 0.0 
nearinc -30688.2738 5827.7088 -5.2659 0.0 
table_joint: 

b se t pval 
Intercept 82517.2276 2726.9101 30.2603 0.0000 
C(year) [T.1981] 18790.2860 4050.0650 4.6395 0.0000 
nearinc -18824.3705 4875.3221 -3.8612 0.0001 
nearinc:C (year) [T.1981] -11863.9033 7456.6462 -1.5911 0.1126 
Script 13.3: Example-13-3-2.py 


data=kielmc) 


42) 


+ age +’ 
og(land) +/ 


» 


4), 
s, 


4)}) 
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Output of Script 13.3: Example-13-3-2.py 


table_did: 

b se t — pval 
Intercept 11.2854 0.0305 369.8386 0.0000 
C(year) [T.1981] 0.1931 0.0453 4.2606 0.0000 
nearinc -0.3399 0.0546  -6.2308 0.0000 
nearinc:C(year)[T.1981] -0.0626 0.0834 -0.7508 0.4533 
table didC: 

b se t — pval 
Intercept 7.6517 0.4159 18.3986 0.0000 
C(year) [T.1981] 0.1621 0.0285 5.6868 0.0000 
nearinc 0.0322 0.0475 0.6789 0.4977 
nearinc:C(year)[T.1981] -0.1315 0.0520 -2.5305 0.0119 
age -0.0084 0.0014 -5.9236 0.0000 
I(age ** 2) 0.0000 0.0000 4.3415 0.0000 
np.log(intst) -0.0614 0.0315 -1.9500 0.0521 
np. log (land) 0.0998 0.0245 4.0766 0.0001 
np. log (area) 0.3508 0.0515 6.8129 0.0000 
rooms 0.0473 0.0173 2.7317 0.0067 
baths 0.0943 0.0277 3.4003 0.0008 


13.3. Organizing Panel Data 


A panel data set includes several observations at different points in time t for the same (or at least 
an overlapping) set of cross-sectional units i. A simple “pooled” regression model could look like 


T; i=in (134) 


Vit = Po + Bixin + Baxio +++ + BkXik t+ On; t= 


where the double subscript now indicates values for individual (or other cross-sectional unit) i at time 
t. We could estimate this model by OLS, essentially ignoring the panel structure. But at least the 
assumption that the error terms are unrelated is very hard to justify since they contain unobserved 
individual traits that are likely to be constant or at least correlated over time. Therefore, we need 
specific methods for panel data. 

For the calculations used by panel data methods, we have to make sure that the data set is sys- 
tematically organized and the estimation routines understand its structure. Usually, a panel data set 
comes in a "long" form where each row of data corresponds to one combination of i and t. We have 
to define which observations belong together by introducing an index variable for the cross-sectional 
units i and preferably also the time index t. 

The module 1inearmodels is a comprehensive collection of commands dealing with panel data. 
It is not part of the Anaconda distribution and you have to install it as explained in Section 1.1.3. 
When working with panel data in Linearmodels, our first line of code always is: 


import linearmodels as plm 


The routines require a pandas data frame with a two-dimensional index that describe the indi- 
vidual and time dimensions. Suppose we have our data in a standard data frame named myd£. It 
includes a variable ivar indicating the cross-sectional units and a variable tvar indicating the time. 
To work with linearmodels we create a data frame with the command 


mydf - mydf.set index(['ivar', 'tvar']) 
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Let’s apply this to the data set CRIME2 discussed by Wooldridge (2019, Section 13.3). It is a 
balanced panel of 46 cities, properly sorted. Script 13.4 (Example-FD.py) imports the data set and 
sets the indices correctly. 

Once we use routines from 1inearmodels, it will report the number of cross-sectional units n, 
the number of time units T, and the total number of observations N. For an example, look at the 
first part of the output in Script 13.5 (Example-13-9.py). 


13.4. First Differenced Estimator 


Wooldridge (2019, Sections 13.3 — 13.5) discusses basic unobserved effects models and their estima- 
tion by first-differencing (FD). Consider the model 


Yit = Po + Bixin b Bei Hai tui t= 


P y 


T (13.2) 


which differs from Equation 13.1 in that it explicitly involves an unobserved effect a; that is constant 
over time (since it has no t subscript). If it is correlated with one or more of the regressors Xit,- . . , Xitkr 
we cannot simply ignore aj, leave it in the composite error term vj; = 4j + uj; and estimate the 
equation by OLS. The error term vj; would be related to the regressors, violating assumption MLR.4 
(and MLR.4’) and creating biases and inconsistencies. Note that this problem is not unique to panel 
data, but possible solutions are. 

The first differenced (FD) estimator is based on the first difference of the whole equation: 


Ait = Yit — Vii 


= BiAxin t o ByAxiy + Aug; BH 2) cag Ts 


(13.3) 


Note that we cannot evaluate this equation for the first observation t = 1 for any i since the lagged 
values are unknown for them. The trick is that a; drops out of the equation by differencing since it 
does not change over time. No matter how badly it is correlated with the regressors, it cannot hurt 
the estimation anymore. This estimating equation is then analyzed by OLS. We simply regress the 
differenced dependent variable Ay;, on the differenced independent variables Axj11,..., Axj1,. 

Script 13.4 (Example-FD.py) opens the data set CRIME2 already described above. We describe 
the cumbersome data preparation required for the manual estimation. Before we can use the method 
diff to calculate first differences of the dependent variable crime rate (ermrte) and the indepen- 
dent variable unemployment rate (unem), we have to make sure with groupby (‘id’) that these 
calculations are performed per individual. 

A list of the first five observations reveals that the differences are unavailable (NaN) for the first 
year of each city. The other differences are also calculated as expected. For example the change of 
the crime rate for city 1 is 70.11729 — 74.65756 = —4.540268 and the change of the unemployment 
rate for city 2 is 5.4 — 8.1 = —27. The FD estimator can now be calculated by simply applying 
OLS to these differenced values. The observations for the first year with missing information are 
automatically dropped from the estimation sample. The results show a significantly positive relation 
between unemployment and crime. 
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Script 13. 


Example-FD.py 


import wooldridge as woo 
import numpy as np 

import pandas as pd 

import statsmodels.formula.api as smf 
import linearmodels as plm 


crime2 = woo.dataWoo(’crime2’) 


# create time variable dummy by converting a Boolean variable to an integer: 
crime2['t'] = (crime2['year'] == 87).astype(int) # False-0, True-1 


# create an index in this balanced data set by combining two arrays: 
id tmp = np.linspace(1, 46, num-46) 
crime2['id'] = np.sort(np.concatenate([id tmp, id tmp])) 


# manually calculate first differences per entity for crmrte and unem: 
crime2['crmrte diffl'] = V 


crime2.sort values(['id', 'year']).groupby('id')['crmrte'].diff() 
crime2['unem diffl'] = V 
crime2.sort values(['id', 'year']).groupby('id') ['unem'].diff() 


var selection - ['id' 
print(f'crime2[var ! 


't', 'crimes', 'unem', 'crmrte diffl', 'unem diffl'] 
ection].head(): \n{crime2[var_selection] .head()}\n’) 


# estimate FD model with statmodels on differenced dat: 
reg sm = smf.ols(formula-'crmrte diffl ~ unem diffl', 
results sm - reg sm.fit() 


# print results: 
table sm = pd.DataFrame(('b': round(results sm.params, 4), 

4 round(results sm.bs 4), 
't': round(results sm.tvalues, 4), 
'pval': round(results sm.pvalues, 4)}) 
print(f'table sm: \n{table_sm}\n’) 


# estimate FD model with linearmodels: 

crime2 - crime2.set index(['id', 'year']) 

reg plm = plm.FirstDifferenceOLS.from formula(formula-'crmrte - t + unem’, 
data-crime2) 

results plm - reg plm.fit() 


# print results: 

table plm - pd.DataFrame(('b': round(results plm.params, 4), 
‘se’: round(results plm.std errors, 4), 
't': round(results plm.tstats, 4), 
'pval': round(results plm.pvalues, 4)}) 

print(f'table plm: Wn(table plm)in') 
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I — — — ——— Output of Script 13.4: Example-FD.py ——______ 
crime2[var selection].head(): 
id t crimes unem crmrte diffl unem diffl 


0 1.0 0 17136.0 8.2 NaN NaN 
1 1.0 1 17306.0 3.7 -4.540268 -4.5 
2 2.0 0 75654.0 8.1 NaN NaN 
3 2.0 1 83960.0 5.4 -2.962654 -2.7 
4 3.0 0 31352.0 9.0 NaN NaN 
table sm: 

b se t — pval 


Intercept 15.4022 4.7021 3.2756 0.0021 
unem diffl 2.2180 0.8779 2.5266 0.0152 


table plm: 

b se t — pval 
t 15.4022 4.7021 3.2756 0.0021 
unem 2.2180 0.8779 2.5266 0.0152 


Generating the differenced values and using ols on them is actually unnecessary. The command 
FirstDifferenceOLs shows that many lines of code can be saved by using the canned routine 
in linearmodels. All the necessary calculations are done internally. As the output of Script 13.4 
(Example-FD.py) shows, the parameter estimates are therefore exactly the same as our pedestrian 
calculations.! 


Wooldridge, Example 13.9: County Crime Rates in North Carolina 

Script 13.5 (Examp1e-13-9.py) analyzes the data CRIME4. We estimate the model in first differences 
Using linearmodels. 

Note that in this specification, all variables are automatically differenced, so they have the intuitive 
interpretation in the level equation. In the results reported by Wooldridge (2019), the year dummies are 
not differenced which only makes a difference for the interpretation of the year coefficients. We will 
repeat this example with “robust” standard errors in Section 14.4. 


Script 13.5: Example-13-9.py — 


import wooldridge as woo 


import numpy as np 
import linearmodels as plm 


crime4 = woo.dataWoo('crime4') 
crime4 = crime4.set index(['county', 'year'], drop=False) 


# estimate FD model: 
reg = plm.FirstDifferenceOLS.from formula( 
formula='np.log(crmrte) ~ year + d83 + d84 + d85 + d86 + d87 +’ 
/lprbarr + lprbconv + lprbpris + lavgsen + lpolpc', 
data=crime4) 
results = reg. fit () 
print (f/results: \n{results}\n’) 


‘Note that in Linearmodels standard errors are accessible by the attribute std_errors instead of bse in stat smodels. 
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Output of Script 13.5: Example-13-9.py ————___ 
results: 
FirstDifferenceOLS Estimation Summar: 


Dep. Variabl np.log(crmrte) ^ R-squared 0.4326 
Estimator: FirstDifferenceOLS ^ R-squared (Between): 0.6003 
No. Observations: 540  R-squared (Within): 0.4281 
Date: Wed, May 13 2020  R-squared (Overall): 0.6000 
Time: 13:04:51 — Log-likelihood 248.48 
Cov. Estimator: Unadjusted 
F-statistic: 36.661 
Entities: 90  P-value 0.0000 
Avg Obs: 7.0000 Distribution: F (11,529) 
Min Obs: 7.0000 
Max Obs: 7.0000  F-statistic (robust): 36.661 
P-value 0.0000 
Time periods: 7 Distribution: F(11,529) 
Avg Obs: 90.000 
Min Obs: 90.000 
Max Obs: 90.000 
Parameter Estimates 
Parameter Std. Err. T-stat P-value Lower CI Upper CI 
year 0.0077 0.0171 0.4522 0.6513 -0.0258 0.0412 
d83 -0.0999 0.0239 -4.1793 0.0000 -0.1468 -0.0529 
d84 -0.1478 0.0413 -3.5806 0.0004 -0.2289 -0.0667 
d85 -0.1524 0.0584 -2.6098 0.0093 -0.2671 -0.0377 
a86 -0.1249 0.0760 -1.6433 0.1009 -0.2742 0.0244 
a87 -0.0841 0.0940 -0.8944 0.3715 -0.2687 0.1006 
lprbarr -0.3275 0.0300 -10.924 0.0000 -0.3864 -0.2686 
lprbconv -0.2381 0.0182 -13.058 0.0000 -0.2739 -0.2023 
lprbpris -0.1650 0.0260 -6.3555 0.0000 -0.2161 -0.1140 
lavgsen -0.0218 0.0221 -0.9850 0.3251 -0.0652 0.0216 
lpolpc 0.3984 0.0269 14.821 0.0000 0.3456 0.4512 
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In this chapter, we look into additional panel data models and methods. We start with the widely 
used fixed effects (FE) estimator in Section 14.1, followed by random effects (RE) in Section 14.2. The 
dummy variable regression and correlated random effects approaches presented in Section 14.3 can 
be used as alternatives and generalizations of FE. Finally, we cover robust formulas for the variance- 
covariance matrix and the implied “clustered” standard errors in Section 14.4. We will come back to 
panel data in combination with instrumental variables in Section 15.6. 


14.1. Fixed Effects Estimation 


We start from the same basic unobserved effects models as Equation 13.2. Instead of first differencing, 
we get rid of the unobserved individual effect a; using the within transformation: 


Vit = Bo + BiXin + +++ ByXin + ai + tin E31, i=1,...,ħ, 
3i = Bot Bika +++ Byte +ai + ii 
Jit = Vit — Hi = Biin + +++ + Beitr + i, (14.1) 


where 7j; is the average of y; over time for cross-sectional unit i and for the other variables accord- 
ingly. The within transformation subtracts these individual averages from the respective observations 
Vit. 

The fixed effects (FE) estimator simply estimates the demeaned Equation 14.1 using pooled OLS. 
Instead of applying the within transformation to all variables and running ols, we can simply 
use PanelOLS in the module linearmodels. Demeaning is considered by adding the word 
EntityEffects to the formula. This has the additional advantage that the degrees of freedom 
are adjusted to the demeaning and the variance-covariance matrix and standard errors are adjusted 
accordingly! We will come back to different ways to get the same estimates in Section 14.3. This is 
shown in Script 14.1 (Examp1e-14-2.py). 


Wooldridge, Example 14.2: Has the Return to Education Changed over Time? 


We estimate the change of the return to education over time using a fixed effects estimator. Script 
14.1 (Example-14-2.py) shows the implementation. The data set WAGEPAN is a balanced panel for 
n — 545 individuals over T — 8 years. It includes the index variables nz and year for individuals and 
years, respectively. Since educ does not change over time, we cannot estimate its overall impact and 
have to use drop absorbed-True in the estimation. However, we can interact it with time dummies to 
see how the impact changes over time. 


‘The default behavior of 1inearmodels is to excludi 
cases you need one, you can explicitly add it by using ” 


nstant, because Bo drops out of the demeaned equation. In 
the formula. 
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Script 14.1: Example-14-2.py 
import wooldridge as woo 
import pandas as pd 

import linearmodels as plm 


wagepan = woo.dataWoo ('wagepan' ) 
wagepan = wagepan.set index(['nr', 'year'], drop=False) 


# FE model estimation: 

reg - plm.PanelOLS.from formula( 
formula-'lwage ~ married + union + C(year)*educ + EntityEffects', 
data=wagepan, drop_absorbed=True) 

results = reg.fit() 


# print regression table 
table = pd.DataFrame(('b': round(results.params, 4), 
‘se’: round(results.std errors, 4), 
'/t': round(results.tstats, 4), 
'pval': round(results.pvalues, 4))) 
print(f'table: \n{table}\n’) 


Output of Script 14.1: Example-14-2.py = 


table: 

b se t — pval 
C (year) [1980] 1.3625 0.0162 83.9031 0.0000 
C (year) [1981] 1.3400 0.1452 9.2307 0.0000 
C (year) [1982] 1.3567 0.1451 9.3481 0.0000 
C (year) [1983] 1.3729 0.1452 9.4561 0.0000 
C (year) [1984] 1.4468 0.1452 9.9617 0.0000 
C (year) [1985] 1.4122 0.1451 9.7315 0.0000 
C (year) [1986] 1.4281 0.1451 9.8404 0.0000 
C (year) [1987] 1.4529 0.1452 10.0061 0.0000 
married 0.0548 0.0184 2.9773 0.0029 
union 0.0830 0.0194 4.2671 0.0000 
C(year)[T.1981]:educ 0.0116 0.0123 0.9448 0.3448 
C(year)[T.1982]:educ 0.0148 0.0123 1.2061 0.2279 
C(year)[T.1983]:educ 0.0171 0.0123 1.3959 0.1628 
C(year)[T.1984]:educ 0.0166 0.0123 1.3521 0.1764 
C(year)[T.1985]:educ 0.0237 0.0123 1.9316 0.0535 
C(year)[T.1986]:educ 0.0274 0.0123 2.2334 0.0256 
C(year)[T.1987]:educ 0.0304 0.0123 2.4798 0.0132 


14.2. Random Effects Models 


We again base our analysis on the basic unobserved effects model in Equation 13.2. The random 
effects (RE) model assumes that the unobserved effects a; are independent of (or at least uncorrelated 
with) the regressors xj; for all t and j = 1,...,k. Therefore, our main motivation for using FD or FE 
disappears: OLS consistently estimates the model parameters under this additional assumption. 
However, like the situation with heteroscedasticity (see Section 8.3) and autocorrelation (see Sec- 
tion 12.2), we can obtain more efficient estimates if we take into account the structure of the variances 
and covariances of the error term. Wooldridge (2019, Section 14.2) shows that the GLS transforma- 
tion that takes care of their special structure implied by the RE model leads to a quasi-demeaned 
specification 
Jit = yit — 09; = Bo(1 — 0) + Bitin +--+ + Bit + Sit, (14.2) 
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where 7j, is similar to the demeaned ij, from Equation 14.1 but subtracts only a fraction 0 of the 
individual averages. The same holds for the regressors xj; and the composite error term vj; = a; + uir. 


The parameter 0 — 1 — y zia depends on the variances of uj; and a; and the length of the time 


series dimension T. Itis unknown and has to be estimated. Given our experience with FD and 
FE estimation, it should not come as a surprise that we can estimate the RE model parameters in 
linearmodels using the command RandomEffects. Different versions of estimating the random 
effects parameter @ can be implemented and one version is saved as the attribute theta in the results 
object (see the module documentation for more details). 

Unlike with FD and FE estimators, we can include variables in our model that are constant over 
time for each cross-sectional unit. We can use pandas methods to provide a list of these variables as 
well as of those that do not vary within each point in time. 


Wooldridge, Example 14.4: A Wage Equation Using Panel Data 


The data set WAGEPAN was already used in Example 14.2. Script 14.2 (Examp1e-14-4-1.py) loads the 
data set and defines the panel structure. Then, we check the panel dimensions and get a list of time- 
constant variables using pandas. Therefore we calculated grouped variances and used the fact that 
they are zero over time or individual. With these preparations, we get estimates using OLS, RE, and 
FE estimators in Script 14.3 (Examp1e-14-4-2.py). We use PooledOLS, RandomEffects and PanelOLS 
(with the option EntityEffects), respectively. 


M — —— Script 14.2: Example-14-4-1.py 
import wooldridge as woo 


wagepan = woo.dataWoo(’wagepan’ ) 


nt dimensions for panel: 

= ape [0] 

T = wagepan[’ year’ ] .drop_duplicates() . shape [0] 
n = wagepan['nr'].drop duplicates ().shape[0] 
print(f'N: (N)W') 

print(f'T: (T)W') 

print(f'n: {n}\n’) 


# check non-varying variables 


# (I) across time and within individuals by calculating individual 
# specific variances for each variable 
isv nr = (wagepan.groupby('nr').var() 0) # True, if variance is zero 
# choose variables where all grouped variances are zero: 

noVar nr = isv nr.all(axis-0) # which cols are completely True 
print(f'isv nr.columns[noVar nr]: \n{isv_nr.columns[noVar_nr]}\n’) 


# (II) across individuals within one point in time for each variable: 
isv t = (wagepan.groupby('year').var() == 0) 

noVar t = isv t.all(axis-0) 

print(f'isv t.columns[noVar t]: \n{isv_t.columns[noVar_t]}\n’) 
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Output of Script 14.2: Example-14-4-1.py 


isv nr.columns[noVar nr]: 
Index(['black', 'hisp', 'educ'], dtype-'object') 


isv t.columns [noVar t]: 
Index(['d81', 'd82', 'd83', 'd84', 'd85', 'd86', 'd87'], dtype-'object') 


Script 14.3: Example-14-4-2.py 
import wooldridge as woo 
import pandas as pd 
import linearmodels as plm 


wagepan = woo.dataWoo ('wagepan') 


# estimate different models: 
wagepan = wagepan.set index(['/nr', 'year'], drop=Fa. 


reg ols - plm.PooledOLS.from formula( 
formula-'lwage ~ educ + black + hisp + exper + I(exper««2) +’ 
'married * union * C(year)', data-wagepan) 
results ols = reg ols.fit() 


reg re - plm.RandomEffects.from formula( 
formula-'lwage ~ educ + black + hisp + exper + I(expere«2) +’ 
‘married + union + C(year)', data-wagepan) 
results re = reg re.fit() 


reg fe - plm.PanelOLS.from formula( 
formula-'lwage ~ I(exper««2) + married + union +’ 
'C(year) + EntityEffects', data=wagepan) 
results fe = reg fe.fit() 


# print results: 
theta hat = results re.theta.iloc[0, 0] 
print(f'theta hat: (theta hat)Wn') 


table ols - pd.DataFrame(('b': round(results ols.params, 4), 
'se': round(results ols.std errors, 4), 
't': round(results ols.tstats, 4), 
'pval': round(results ols.pvalues, 4)]) 
print(f'table ols: \n{table_ols}\n’) 


table re - pd.DataFrame(('b': round(results re.params, 4), 
‘se’: round(results re.std errors, 4), 
't': round(results re.tstats, 4), 
‘pval’: round(results re.pvalues, 4) }) 
print(f'table re: \n{table_re}\n’) 
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table fe = pd.DataFrame(('b': round(results fe.params, 4), 
‘se’: round(results fe.std errors, 4), 
't': round(results fe.tstats, 4), 
'pval': round(results fe.pvalues, 4)}) 
print(f'table fe: \n{table_fe}\n’) 


Output of Script 14.3: Example-14-4-2.py 


theta hat: 0.6450593029243452 
table ols: 

b se t — pval 
C(year)[1980] 0.0921 0.0783 1.1761 0.2396 
C(year)[1981] 0.1504 0.0838 1.7935 0.0730 
C(year)[1982] 0.1548 0.0893 1.7335 0.0831 
C(year)[1983] 0.1541 0.0944 1.6323 0.1027 
C(year)[1984] 0.1825 0.0990 1.8437 0.0653 
C(year)[1985] 0.2013 0.1031 1.9523 0.0510 
C(year)[1986] 0.2340 0.1068 2.1920 0.0284 
C(year)[1987] 0.2659 0.1100 2.4166 0.0157 
educ 0.0913 0.0052 17.4419 0.0000 
black -0.1392 0.0236 -5.9049 0.0000 
hisp 0.0160 0.0208 0.7703 0.4412 
exper 0.0672 0.0137 4.9095 0.0000 
I(exper «« 2) -0.0024 0.0008 -2.9413 0.0033 
married 0.1083 0.0157 6.8997 0.0000 
union 0.1825 0.0172 10.6349 0.0000 
table re: 

b se t — pval 


C (year) [1980] 
C (year) [1981] 
C (year) [1982] 


-1546 0.8771 
-3988 0.6901 
-3211 0.7481 


0 0. 0 
0 0. 0 
0 0. 0 
C(year)[1983] 0.0436 0.1780 0.2450 0.8065 
C(year)[1984] 0.0664 0.1871 0.3551 0.7225 
C(year)[1985] 0.0811 0.1961 0.4136 0.6792 
C(year)[1986] 0.1152 0.2052 0.5617 0.5744 
C(year) [1987] 0.1583 0.2143 0.7386 0.4602 
educ 0.0919 0.0107 8.5744 0.0000 
black -0.1394 0.0480 -2.9054 0.0037 
hisp 0.0217 0.0428 0.5078 0.6116 
exper 0.1058 0.0154 6.8706 0.0000 
I(exper «« 2) -0.0047 0.0007 -6.8623 0.0000 
married 0.0638 0.0168 3.8035 0.0001 
union 0.1059 0.0179 5.9289 0.0000 
table fe: 
b se t pval 
C(year) [1980] 1.4260 0.0183 77.7484 0.0000 
C(year)[1981] 1.5772 0.0216 72.9656 0.0000 
C(year) [1982] 1.6790 0.0265 63.2583 0.0000 
C(year) [1983] 1.7805 0.0333 53.4392 0.0000 
C(year)[1984] 1.9161 0.0417 45.9816 0.0000 
C(year)[1985] 2.0435 0.0515 39.6460 0.0000 
C(year) [1986] 2.1915 0.0630 34.7714 0.0000 
C(year) [1987] 2.3510 0.0762 30.8669 0.0000 
I(exper ** 2) -0.0052 0.0007 -7.3612 0.0000 
married 0.0467 0.0183 2.5494 0.0108 
union 0.0800 0.0193 4.1430 0.0000 
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The RE estimator needs stronger assumptions to be consistent than the FE estimator. On the 
other hand, it is more efficient if these assumptions hold and we can include time constant regres- 
sors. A widely used test of this additional assumption is the Hausman test. It is based on the 
comparison between the FE and RE parameter estimates. We include an example as Script 14.4 
(Example-HausmTest.py) in Appendix IV (p. 387), which uses the FE and RE estimates and im- 
plements a Hausman test as shown in Wooldridge (2010) (Section 10.7.3). The null hypothesis that 
the RE model is consistent is clearly rejected with sensible si ance levels like a = 5% or « = 1%. 
It also demonstrates that implementing a test on your own is a lot more cumbersome than relying 
completely on a module’s routines. 


14.3. Dummy Variable Regression and Correlated Random 
Effects 


It turns out that we can get the FE parameter estimates in two other ways than the within transfor- 
mation we used in Section 14.1. The dummy variable regression uses OLS on the original variables 
in Equation 13.2 instead of the transformed ones. But it adds n — 1 dummy variables (or n dummies 
and removes the constant), one for each cross-sectional unit i = 1,...,n. The simplest (although 
not the computationally most efficient) way to implement this in Python is to use the cross-sectional 
index as another categorical variable. 

The third way to get the same results is the correlated random effects (CRE) approach. Instead of 
assuming that the individual effects a; are independent of the regressors xij, we assume that they 
only depend on the averages over time £j; = H Ya Xitj: 


aj = Yo + ma +*+- + Yik Ti (14.3) 
Vit = Po + BiXin + +++ Bii + Ai + tig 
= Po + Yo + Bixin +++ + Bexik ma +++ mii +i + Uit (14.4) 


If r; is uncorrelated with the regressors, we can consistently estimate the parameters of this model 
using the RE estimator. In addition to the original regressors, we include their averages over time. 

Script 14.5 (Example-Dummy-CRE-1.py) uses WAGEPAN again. We estimate the FE parameters 
using the within transformation (reg_we), the dummy variable approach (reg_dum), and the CRE 
approach (reg_cre). We also estimate the RE version of this model (reg_re). The results confirm 
that the first three methods deliver exactly the same parameter estimates, while the RE estimates 
differ. 
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Script 14.5: Example-Dummy-CRE-1.py 
import wooldridge as woo 
import pandas as pd 
import statsmodels.formula.api as smf 
import linearmodels as plm 


wagepan = woo.dataWoo('wagepan') 
wagepan['t'] = wagepan['year'] 
wagepan['entity'] = wagepan['nr'] 
wagepan - wagepan.set index(['nr']) 


# include group specific means: 
wagepan['married b'] = wagepan.groupby ('nr').mean() ['married'] 
wagepan['union b'] = wagepan.groupby('nr').mean() ['union'] 
wagepan - wagepan.set index(['year'], append-True) 


# estimate FE parameters in 3 different ways: 
reg we - plm.PanelOLS.from formula( 
formula-'lwage ~ married + union + C(t)*educ + EntityEffects', 
drop_absorbed=True, data=wagepan) 
results we = reg we.fit() 


reg dum - smf.ols( 
formula-'lwage ~ married + union + C(t)*educ + C(entity)', 
data=wagepan) 

results dum = reg dum.fit() 


reg cre - plm.RandomEffects.from formula( 
formula-'lwage ~ married + union + C(t)*educ + married b + union b', 
data-wagepan) 

results cre - reg cre.fit() 


# compare to RE estimate 

reg re - plm.RandomEffects.from formula( 
formula-'lwage ~ married + union + C(t)*educ', 
data=wagepan) 

results re = reg re.fit() 


var selection = ['married', ‘union’, 'C(t)[T.1982]:educ'] 


# print results: 

table - pd.DataFrame(('b we': round(results we.params[var selection], 4), 
'b dum': round(results dum.params[var selection], 4), 
'b cre': round(results cre.params[var selection], 4), 
'b re': round(results re.params[var selection], 4)]) 

print(f'table: \n{table}\n’) 


Output of Script 14.5: Example-Dummy-CRE-1.py 


table: 

b we b dum b cre b re 
married 0.0548 0.0548 0.0548 0.0773 
union 0.0830 0.0830 0.0830 0.1075 


C(t)[T.1982]:educ 0.0148 0.0148 0.0148 0.0143 
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Given we have estimated the CRE model, it is easy to test the null hypothesis that the RE es- 
timator is consistent. The additional assumptions needed are yı = --- = yg = 0. They can 
easily be tested using an F test or the very similar Wald test as demonstrated in Script 14.6 
(Example-CRE-test-RE.py). As you see, linearmodels conveniently provides the routines 
for these tests. Like the Hausman test, we clearly reject the null hypothesis that the RE model is 
appropriate with a tiny p value of about 0.0001. 


E — —— — Script 14.6: Example-CRE-test-RE.py 
import wooldridge as woo 
import linearmodels as plm 


wagepan = woo.dataWoo ('wagepan') 
wagepan['t'] = wagepan['year'] 
wagepan['entity'] = wagepan['nr'] 
wagepan - wagepan.set index(['nr']) 


# include group specific means: 
wagepan['married b'] = wagepan.groupby ('nr').mean()['married'] 
wagepan['union b'] = wagepan.groupby ('nr').mean() ['union'] 
wagepan - wagepan.set index(['year'], append-True) 


# estimate CRE: 

reg cre = plm.RandomEffects.from formula ( 
formula-'lwage ~ married + union + C(t)*educ + married b + union b', 
data=wagepan) 

results cre = reg cre.fit() 


# RE test as an Wald test on the CRE specific coefficients: 
wtest - results cre.wald test(formula-'married b - union b 
print(f'wtest: \n{wtest}\n’) 


Output of Script 14.6: Example-CRE-test-RE.py 
wtest: 
Linear Equality Hypothesis Test 
H0: Linear equality constraint is valid 
Statistic: 19.4058 
P-value: 0.0001 
Distributed: chi2(2) 


Another advantage of the CRE approach is that we can add time-constant regressors to the model. 
Since we cannot control for average values 1;; for these variables, they have to be uncorrelated with a; 
for consistent estimation of their coefficients. For the other coefficients of the time-varying variables, 
we still don't need these additional RE assumptions. 

Script 14.7 (Example-CRE-2 . py) estimates another version of the wage equation using the CRE 
approach. The variables married and union vary over time, so we can control for their between 
effects. The variables educ, black, and hisp do not vary. For a causal interpretation of their 
coefficients, we have to rely on uncorrelatedness with aj. Given a; includes intelligence and other 
labor market success factors, this uncorrelatedness is more plausible for some variables (like gender 
or race) than for other variables (like education). 
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Script 147: Example-CRE-2.py 
import wooldridge as woo 
import pandas as pd 

import linearmodels as plm 


wagepan = woo.dataWoo('wagepan') 
wagepan['t'] = wagepan['year'] 
wagepan['entity'] = wagepan['nr'] 
wagepan = wagepan.set index(['nr']) 


# include group specific means: 
wagepan['married b'] = wagepan.groupby ('nr').mean() ['married'] 
wagepan['union b'] = wagepan.groupby ('nr') .mean() [' union’ ] 
wagepan = wagepan.set index(['year'], append-True) 


# estimate CRE paramters: 
reg - plm.RandomEffects.from formula( 
formula-'lwage ~ married + union + educ +’ 
‘black + hisp + married b + union b', 
data-wagepan) 
results - reg.fit() 


# print regression tabl 
table = pd.DataFrame({’b 


round (ri 
round(results.tstats, 4), 

'pval': round(results.pvalues, 4))) 
print(f'table: \n{table}\n’) 


" - Output of Script 14.7: Example-CRE-2.py — 


table: 

b se t — pval 
married 0.2417 0.0177 13.6772 0.0000 
union 0.0700 0.0207 3.3804 0.0007 
educ 0.1257 0.0023 55.4837 0.0000 
black -0.0892 0.0499 -1.7864 0.0741 
hisp 0.0784 0.0426 1.8428 0.0654 
married b -0.0436 0.0450 -0.9685 0.3329 
union b 0.2105 0.0519 4.0576 0.0001 


14.4. Robust (Clustered) Standard Errors 


We argued above that under the RE assumptions, OLS is inefficient but consistent. Instead of using 
RE, we could simply use OLS but would have to adjust the standard errors for the fact that the 
composite error term vj; = aj + uj; is correlated over time because of the constant individual effect a;. 
In fact, the variance-covariance matrix could be more complex than the RE assumption with iid. uj, 
implies. These error terms could be serially correlated and/or heteroscedastic. This would invalidate 
the standard errors not only of OLS but also of FD, FE, RE, and CRE. 

There is an elegant solution, especially in panels with a large cross-sectional dimension. Similar 
to standard errors that are robust with respect to heteroscedasticity in cross-sectional data (Section 
8.1) and serial correlation in time series (Section 12.3), there are formulas for the variance-covariance 
matrix for panel data that are robust with respect to heteroscedasticity and arbitrary correlations of 
the error term within a cross-sectional unit (or “cluster”). 
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These "clustered" standard errors are mentioned in Wooldridge (2019, Section 14.4 and Ex- 
ample 13.9). Different versions of the clustered variance-covariance matrix can be computed in 
linearmodels. Script 14.8 (Example-13-9-ClSE.py) repeats the FD regression from Example 
13.9 and reports the adjusted standard errors. Similar to the heteroscedasticity-robust standard er- 
rors discussed in Section 8.1, there are different versions of formulas for clustered standard errors. 
We first use the default type (results default), a clustered type without (results cluster) 
and with a small sample correction (results css). The latter uses debiased-True (default) to 
adjust the degrees of freedom when estimating the covariance. 


Script 14.8: Example-13-9-ClSE.py 
import wooldridge as woo 
import numpy as np 
import pandas as pd 
import linearmodels as plm 


crime4 = woo.dataWoo('crime4') 
crime4 = crime4.set index(['county', 'year'], drop-False) 


# estimate FD model: 
reg = plm.FirstDifferenceOLS.from formula( 
formula-'np.log(crmrte) ~ year + d83 + d84 + d85 + d86 + d87 +/ 
/lprbarr + lprbconv + lprbpris + lavgsen + lpolpc', 
data=crime4) 


# regression with standard SE: 
results default = reg.fit() 


# regression with "clustered" SE: 


results cluster = clustered', cluster entity-True, 


"clustered" SE (small-sample correction): 
.fit(cov type-'clustered', cluster entity-True) 


= pd.DataFrame(('b': round(results default.params, 4), 

'se default': round(results default.std errors, 4), 
cluster’: round(results cluster.std errors, 4), 
: round(results css.std errors, 4))) 


print(f'table: \n{table}\n’) 


Output of Script 14.8: Example-13-9-ClSE.py 


table: 

b se default se cluster se css 
year 0.0077 0.0171 0.0136 0.0137 
d83 -0.0999 0.0239 0.0219 0.0222 
d84 -0.1478 0.0413 0.0356 0.0359 
d85 -0.1524 0.0584 0.0505 0.0511 
d86 -0.1249 0.0760 0.0624 0.0630 
d87 -0.0841 0.0940 0.0773 0.0781 
lprbarr -0.3275 0.0300 0.0556 0.0562 
lprbconv -0.2381 0.0182 0.0390 0.0394 
lprbpris -0.1650 0.0260 0.0451 0.0456 
lavgsen -0.0218 0.0221 0.0254 0.0257 
lpolpc 0.3984 0.0269 0.1014 0.1025 


15. Instrumental Variables Estimation and 
Two Stage Least Squares 


Instrumental variables are potentially powerful tools for the identification and estimation of causal 
effects. We start the discussion in Section 15.1 with the simplest case of one endogenous regressor 
and one instrumental variable. Section 15.2 shows how to implement models with additional exoge- 
nous regressors. In Section 15.3, we will introduce two stage least squares which efficiently deals 
with several endogenous variables and several instruments. 

Tests of the exogeneity of the regressors and instruments are presented in Sections 15.4 and 15.5, 
respectively. Finally, Section 15.6 shows how to conveniently combine panel data estimators with 
instrumental variables. 


15.1. Instrumental Variables in Simple Regression Models 


We start the discussion of instrumental variables (IV) regression with the most straightforward case 
of only one regressor and only one instrumental variable. Consider the simple linear regression 
model for cross-sectional data 

y = Bo Bix +u. (15.1) 


The OLS estimator for the slope parameter is pe = E , see Equation 2.3. Suppose the regressor 
x is correlated with the error term u, so OLS parameter estimators will be biased and inconsistent. 
If we have a valid instrumental variable z, we can consistently estimate B; using the IV estimator 


n _ Cov(z, y) 


1 Cov(zx) (15.2) 


A valid instrument is correlated with the regressor x (“relevant”), so the denominator of Equation 
15.2 is nonzero. It is also uncorrelated with the error term u (“exogenous”). Wooldridge (2019, 
Section 15.1) provides more discussion and examples. 

To implement IV regression in Python, the module linearmode1s offers the command Iv2SLS 
including the convenient formula syntax we know from statsmodels. When working with IV 
regression in Linearmodels, our first line of code always is: 


import linearmodels.iv as iv 


In the formula specification, the endogenous regressor(s) x end and instruments z are provided 
in the following way: 


y^ 1*[xend-z] 


Note that we can easily consider different assumptions about the error term by providing the argu- 
ment cov. type to the fit method. If you use cov type-' unadjusted’ error terms are assumed 
to be homoskedastic. In combination with debiased-True this is the right option if you want to 
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replicate results in Wooldridge (2019). The argument cov_type=’ robust’ is the default and im- 
plements a robust estimation. Also remember that constants in linearmodels must be explicitly 
included by adding "1" to the formula. For other options, see the module documentation. 


Wooldridge, Example 15.1: Return to Education for Married Women 


Script 15.1 (Examp1e-15-1.py) uses data from MRoz. We only analyze women with non-missing wage, 
so we use the method dropna to extract them. We want to estimate the return to education (educ) 
for these women. As an instrumental variable for education, we use the education of her father 
(fatheduc). 

First, we calculate the OLS and IV slope parameters according to Equations 2.3 and 15.2. Then, the 
full OLS and IV estimates are calculated using the boxed routines ols and IV2SLs, respectively. Not 
surprisingly, the slope parameters match the manual results. 


r Script 15.1: Example-15-1.py —— 
import wooldridge as woo 
import numpy as np 

import pandas as pd 

import linearmodels.iv as iv 

import statsmodels.formula.api as smf 


mroz = woo.dataWoo('mroz') 


# restrict to non-missing wage observations: 


mroz = mroz.dropna 'lwage']) 
cov yz = np.cov(mroz['lwage'], mroz['fatheduc']) [1, 0] 
cov xy = np.cov(mroz['educ'], mroz[’lwage’])[1, 0] 
Cov xz = np.cov(mroz['educ'], mroz['fatheduc'])[1, 0] 
var x = np.var(mroz['educ'], ddof=1) 


x bar 
y_bar 


np.mean (mroz [' educ' ]) 
np.mean (mroz ['1wage']) 


4 OLS slope parameter manually: 
b ols man = cov xy / var x 
print(f'b ols man: (b ols man)n') 


# IV slope parameter manually: 
b iv man - cov yz / cov xz 
print(f'b iv man: (b iv man) n') 


# OLS automatically: 
reg ols = smf.ols(formula-'np.log(wage) ~ educ', data=mroz) 
results ols - reg ols.fit() 


# print regression table: 

table ols - pd.DataFrame(('b': round(results ols.params, 4), 
'se': round(results ols.bse, 4), 
't': round(results ols.tvalues, 4), 
'pval': round(results ols.pvalues, 4)]) 

print(f'table ols: \n{table_ols}\n’) 


# IV automatically: 

reg iv = iv.IV2SLS.from formula(formula-'np.log(wage) ~ 1 + [educ ~ fatheduc]', 
data-mroz) 

results iv - reg iv.fit(cov type-'unadjusted', debiased-True) 
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# print regression table: 
table iv = pd.DataFrame(('b': round(results iv.params, 4), 
‘se’: round(results iv.std errors, 4), 
't': round(results iv.tstats, 4), 
'pval': round(results iv.pvalues, 4)}) 
print(f'table iv: \n{table_iv}\n’) 


p — — — — — — — Output of Script 15.1: Example-15-1.py 
b ols man: 0.10864865517467534 


b iv man: 0.05917347999936601 


table ols: 

b se t  pval 
Intercept -0.1852 0.1852 -0.9998 0.318 
educ 0.1086 0.0144 7.5451 0.000 
table iv: 

b se t — pval 
Intercept 0.4411 0.4461 0.9888 0.3233 
educ 0.0592 0.0351 1.6839 0.0929 


15.2. More Exogenous Regressors 


The IV approach can easily be generalized to include additional exogenous regressors, i.e. regressors 
that are assumed to be unrelated to the error term. In the formula specification of IV2SLS, the 
exogenous regressor(s) x exg, the endogenous regressor(s) x end and instruments z are provided 
in the following way: 


ly ~ 1 + x_exg + [ x_end ~ z ] ] 


Wooldridge, Example 15.4: Using College Proximity as an IV for Education 


In Script 15.2 (Examp1e-15-4.py), we use CARD to estimate the return to education. Education is al- 
lowed to be endogenous and instrumented with the dummy variable neare4 which indicates whether 
the individual grew up close to a college. In addition, we control for experience, race, and regional 
information. These variables are assumed to be exogenous and act as their own instruments. 

We first check for relevance by regressing the endogenous independent variable educ on all exogenous 
variables including the instrument nearc4. Its parameter is highly significantly different from zero, so 
relevance is supported. We then estimate the log wage equation with OLS and IV. 
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Script 15.2: Example-15-4.py 
import wooldridge as woo 
import numpy as np 

import pandas as pd 

import linearmodels.iv as iv 

import statsmodels.formula.api as smf 


card = woo.dataWoo('card') 


# checking for relevance with reduced form: 

reg redf - smf.ols( 
formula-'educ ~ nearc4 + exper + I(exper««2) + black + smsa +’ 
‘south + smsa66 + reg662 + reg663 + reg664 + reg665 + reg666 +’ 
'reg667 + reg668 + reg669', data-card) 

results redf - reg redf.fit() 


# print regression table: 

table redf = pd.DataFrame({’b’: round(results redf.params, 4), 
‘se’: round(results redf.bse, 4), 
't': round(results redf.tvalues, 4), 
'pval': round(results redf.pvalues, 4)]) 

print(f'table redf: \n{table_redf}\n’) 


/np.log(wage) ~ educ + exper + I(exper««2) + black + smsa +’ 
‘south + smsa66 + reg662 + reg663 + reg664 + reg665 +’ 

'reg666 + reg667 + reg668 + reg669’, data-card) 

ols = reg ols.fit() 


table ols = pd.DataFrame(('b': round(results_ols.params, 4), 
' round(results ols.bse, 4), 
't': round(results ols.tvalues, 4), 
'pval': round(results ols.pvalues, 4))) 
print(f'table ols: \n{table_ols}\n’) 


# IV automatically: 
reg iv - iv.IV2SLS.from formula( 
formula-'np.log(wage)- 1 + exper + I(exper**2) + black + smsa + ' 
‘south + smsa66 + reg662 + reg663 + reg664 + reg665 +’ 
'reg666 + reg667 + reg668 + reg669 + [educ ~ nearc4]', 
data=card) 
results iv = reg iv.fit(cov type-'unadjusted', debiased-True) 


# print regression table: 

table iv = pd.DataFrame(('b': round(results iv.params, 4), 
'se': round(results iv.std errors, 4), 
't': round(results iv.tstats, 4), 
'pval': round(results iv.pvalues, 4))) 

print(f'table iv: \n{table_iv}\n’) 
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table redf: 


Intercept 16 
nearc4 0 
exper -0 
I(exper «« 2) 0 
black -0 
smsa 0 
south -0 
smsa66 0 
reg662 -0 
reg663 -0 
reg664 0 
reg665 -0 
reg666 -0 
reg667 -0 
reg668 0 
reg669 0 


table ols: 


Intercept 4 
educ 0 
exper 0 
I(exper «« 2) -0 
black -0 
smsa 0 
south -0 
smsa66 0 
reg662 0. 
reg663 0 
reg664 0 
reg665 0 
reg666 0 
reg667 0 
reg668 -0 
reg669 0 


table iv: 


Intercept 3. 
exper 0 
I(exper «« 2) -0. 
black -0. 
smsa 0. 
south -0. 
smsa66 

reg662 

reg663 

reg664 

reg665 

reg666 

reg667 

reg668 - 
reg669 

educ 


ooóoooooooo 


Output of Script 15.2: Example-15-4.py 


b se t pval 
.6383 0.2406 69.1446 0.0000 
.3199 0.0879 3.6408 0.0003 
.4125 0.0337 -12.2415 0.0000 
-0009 0.0017 0.5263 0.5987 
.9355 0.0937 -9.9806 0.0000 
+4022 0.1048 3.8372 0.0001 
.0516 0.1354 -0.3811 0.7032 
+0255 0.1058 0.2409 0.8096 
+0786 0.1871 -0.4203 0.6743 
+0279 0.1834 -0.1524 0.8789 
+1172 0.2173 0.5394 0.5897 
+2726 0.2184 -1.2481 0.2121 
+3028 0.2371 -1.2773 0.2016 
.2168 0.2344 -0.9250 0.3550 
+5239 0.2675 1.9587 0.0502 
+2103 0.2025 1.0386 0.2991 

b se t — pval 
6208 0.0742 62.2476 0.0000 
0747 0.0035 21.3510 0.0000 
0848 0.0066 12.8063 0.0000 
0023 0.0003 -7.2232 0.0000 
1990 0.0182 -10.9058 0.0000 
1364 0.0201 6.7851 0.0000 
1480 0.0260 -5.6950 0.0000 
0262 0.0194 1.3493 0.1773 
0964 0.0359 2.6845 0.0073 
1445 0.0351 4.1151 0.0000 
0551 0.0417 1.3221 0.1862 
1280 0.0418 3.0599 0.0022 
1405 0.0452 3.1056 0.0019 
1180 0.0448 2.6334 0.0085 
0564 0.0513 -1.1010 0.2710 
1186 0.0388 3.0536 0.0023 

b se t — pval 
6662 0.9248 3.9641 0.0001 
1083 0.0237 4.5764 0.0000 
0023 0.0003 -7.0014 0.0000 
1468 0.0539 -2.7231 0.0065 
1118 0.0317 3.5313 0.0004 
1447 0.0273 -5.3023 0.0000 
0185 0.0216 0.8576 0.3912 
1008 0.0377 2.6739 0.0075 
1483 0.0368 4.0272 0.0001 
0499 0.0437 1.1408 0.2541 
1463 0.0471 3.1079 0.0019 
1629 0.0519 3.1382 0.0017 
1346 0.0494 2.7240 0.0065 
0831 0.0593 -1.4002 0.1616 
1078 0.0418 2.5784 0.0100 
1315 0.0550 2.3926 0.0168 
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15.3. Two Stage Least Squares 


Two stage least squares (2SLS) is a general approach for IV estimation when we have one or more en- 
dogenous regressors and at least as many additional instrumental variables. Consider the regression 
model 

yi = Bo + Biy2 + Boys + B3z1 + B422 + Bsz3 + u1. (15.3) 


The regressors y» and y; are potentially correlated with the error term 1, the regressors z;, z2, and 
23 are assumed to be exogenous. Because we have two endogenous regressors, we need at least two 
additional instrumental variables, say z4 and zs. 

The name of 2SLS comes from the fact that it can be performed in two stages of OLS regressions: 

(1) Separately regress y? and ys on zı through zs. Obtain fitted values ŷ and #3. 

(2) Regress y; on 9, 93, and z through z3. 
If the instruments are valid, this will give consistent estimates of the parameters Bo through fs. 
Generalizing this to more endogenous regressors and instrumental variables is obvious. 

This procedure can of course easily be implemented using ols in statsmodels, remembering 
that fitted values are saved in £ittedvalues. One of the problems of this manual approach is 


that the resulting variance-covariance matrix and analyses based on them are invalid. Conveniently, 
IV2SLS will automatically do these calculations and calculate correct standard errors and the like. 


Wooldridge, Example 15.5: Return to Education for Married Women 


We continue Example 15.1 and still want to estimate the return to education for women using the data 
in MROZ. Now, we use both mother's and father's education as instruments for their own education. 

In Script 15.3 (Examp1e-15-5.py), we obtain 2SLS estimates in two ways: First, we do both stages man- 
ually, including fitted education as educ fitted as a regressor in the second stage. Iv2sLs does this 
automatically and delivers the same parameter estimates as the output table reveals. But the standard 
errors differ slightly because the manual two stage version did not correct them. 


Script 15.3: Example-15-5.py 
import wooldridge as woo 


import numpy as np 
import pandas as pd 


import linearmodels.iv as iv 
import statsmodels.formula.api as smf 


mroz = woo.dataWoo('mroz') 


# restrict to non-missing wage observations: 
mroz = mroz.dropna (subset-['lwage']) 


# 1st stage (reduced form): 
reg redf = smf.ols(formula-'educ ~ exper + I(exper**2) + motheduc + fatheduc', 


data=mroz) 
results redf = reg redf.fit() 
mroz['educ fitted'] = results redf.fittedvalues 


# print regression table: 

table redf = pd.DataFrame(('b': round(results redf.params, 4), 
‘se’: round(results redf.bse, 4), 
't': round(results redf.tvalues, 4), 
'pval': round(results redf.pvalues, 4)]) 

print(f'table redf: \n{table_redf}\n’) 
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# 2nd stage: 

reg secstg = smf.ols(formula-'np.log(wage) ~ educ fitted + exper + I(exper*+#2)’, 
data=mroz) 

results_secstg = reg_secstg. fit () 


# print regression table: 

table secstg = pd.DataFrame(('b': round(results secstg.params, 4), 
‘se’: round(results secstg.bse, 4), 
/t': round(results secstg.tvalues, 4), 
‘pval’: round(results secstg.pvalues, 4))) 

print(f'table secstg: \n{table_secstg}\n’) 


# IV automatically: 
reg iv - iv.IV2SLS.from formula( 
formula-'np.log(wage) ~ 1 + exper + I(exper««2) +' 
' [educ ~ motheduc + fatheduc]', 
data-mroz) 
results iv - reg iv.fit(cov type-'unadjusted', debiased-True) 


# print regression table: 

table iv = pd.DataFrame(('b': round(results iv.params, 4), 
round(results iv.std 
't': round(results iv.tstats, 4), 
'pval': round(results iv.pvalues, 4)}) 
\n{table_iv}\n’) 


print(f'table i 


Output of Script 15.3: Example-15-5.py 


table redf: 

b se t — pval 
Intercept 9.1026 0.4266 21.3396 0.0000 
exper 0.0452 0.0403 1.1236 0.2618 
I(exper «« 2) -0.0010 0.0012 -0.8386 0.4022 
motheduc 0.1576 0.0359 4.3906 0.0000 
fatheduc 0.1895 0.0338 5.6152 0.0000 
table secstg: 

b se t — pval 
Intercept 0.0481 0.4198 0.1146 0.9088 
educ fitted ^ 0.0614 0.0330 1.8626 0.0632 
exper 0.0442 0.0141 3.1361 0.0018 
I(exper ** 2) -0.0009 0.0004 -2.1344 0.0334 
table iv: 

b se t — pval 
Intercept 0.0481 0.4003 0.1202 0.9044 
exper 0.0442 0.0134 3.2883 0.0011 
I(exper «« 2) -0.0009 0.0004 -2.2380 0.0257 
educ 0.0614 0.0314 1.9530 0.0515 
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15.4. Testing for Exogeneity of the Regressors 


There is another way to get the same IV parameter estimates as with 2SLS. In the same setup as 
above, this “control function approach” also consists of two stages: 

(1) Like in 2SLS, regress y? and y3 on zı through zs. Obtain residuals 2 and 43 instead of fitted 

values 92 and 93. 

(2) Regress y; on y», ys, 21, Z2, 5, and the first stage residuals ô, and 95. 

This approach is as simple to implement as 2SLS and will also result in the same parameter estimates 
and invalid OLS standard errors in the second stage (unless the dubious regressors y? and y3 are in 
fact exogenous). 

After this second stage regression, we can test for exogeneity in a simple way assuming the instru- 
ments are valid. We just need to do a t or F test of the null hypothesis that the parameters of the 
first-stage residuals are equal to zero. If we reject this hypothesis, this indicates endogeneity of y 
and y3. 


Wooldridge, Example 15.7: Return to Education for Married Women 


In Script 15.4 (Example-15-7.py), we continue Example 15.5 using the control function approach. 
Again, we use both mother's and father's education as instruments. The first stage regression is identi- 
cal as in Script 15.3 (Examp1e-15-5.py). The second stage adds the first stage residuals to the original 
list of regressors. The parameter estimates are identical to both the manual 2SLS and the automatic 
1V2SLS results. We can perform a t test based on the regression table as a test for exogeneity. Here, 
t= g =æ 1.67 with a two-sided p value of p = 0.095, indicating a marginally significant evidence for 
endogeneity. 


———— Script 15.4: Example-15-7.py 
import wooldridge as woo 

import numpy as np 

import pandas as pd 

import statsmodels.formula.api as smf 


mroz = woo.dataWoo('mroz') 


# restrict to non-missing wage observations: 
mroz = mroz.dropna(subset-['lwage']) 


# 1st stage (reduced form): 

reg redf = smf.ols(formula-'educ ~ exper + I(exper««2) + motheduc + fatheduc', 
data-mroz) 

results redf - reg redf.fit() 

mroz['resid'] - results redf.resid 


4 2nd stage: 

reg secstg = smf.ols(formula-'np.log(wage)- resid + educ + exper + I(exper««2)', 
data-mroz) 

results secstg - reg secstg.fit() 


4 print regression table: 

table secstg = pd.DataFrame(('b': round(results secstg.params, 4), 
‘se’: round(results secstg.bse, 4), 
'/t': round(results secstg.tvalues, 4), 
‘pval’: round(results secstg.pvalues, 4) }) 

print (f’table_secstg: \n{table_secstg}\n’) 
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Output of Script 15.4: Example-15-7.py 


table_secstg: 

b se t pval 
Intercept 0.0481 0.3946 0.1219 0.9030 
resid 0.0582 0.0348 1.6711 0.0954 
educ 0.0614 0.0310 1.9815 0.0482 
exper 0.0442 0.0132 3.3363 0.0009 
I(exper «« 2) -0.0009 0.0004 -2.2706 0.0237 


15.5. Testing Overidentifying Restrictions 


If we have more instruments than endogenous variables, we can use either all or only some of them. 
1f all are valid, using all improves the accuracy of the 2SLS estimator and reduces its standard errors. 
If the exogeneity of some is dubious, including them might cause inconsistency. It is therefore useful 
to test for the exogeneity of a set of dubious instruments if we have another (large enough) set that 
is undoubtedly exogenous. The procedure is described by Wooldridge (2019, Section 15.5): 

(1) Estimate the model by 2SLS and obtain residuals iij. 

(2) Regress £4; on all exogenous variables and calculate Ri. 

(3) The test statistic nR? is asymptotically distributed as Xx where q is the number of overidentifying 

restrictions, i.e. number of instruments minus number of endogenous regressors. 


Wooldridge, Example 15.8: Return to Education for Married Women 


We will again use the data and model of Examples 15.5 and 15. Script 15.5 (Example-15-8.py) 
estimates the model using 1V2SLs. The results are stored in variable results iv. We then run the 
auxiliary regression and compute its R? as r2. The test statistic teststat is computed to be 0.378. We 
also compute the p value from the xn distribution. We cannot reject exogeneity of the instruments using 
this test. But be aware of the fact that the underlying assumption that at least one instrument is valid 
might be violated here. 
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Script 15.5: Example-15-8.py 
import wooldridge as woo 
import numpy as np 

import pandas as pd 

import linearmodels.iv as iv 

import statsmodels.formula.api as smf 
import scipy.stats as stats 


mroz = woo.dataWoo('mroz') 


# restrict to non-missing wage observations: 
mroz = mroz.dropna(subset-['lwage']) 


# IV regression: 

reg iv = iv.IV2SLS.from formula(formula-'np.log(wage) ~ 1 + exper + I(exper««2) +’ 
'[educ ~ motheduc + fatheduc]’, data-mroz) 

results iv = reg_iv.fit (cov_type=’ unadjusted’, debiased=True) 


# print regression table: 
table iv = pd.DataFrame (('b' 


round(results iv.params, 4), 
round(results iv.std errors, 4), 
't': round(results tstaf » 
'pval': round(results iv.pvalues, 4))) 
print(f'table iv: \n{table_iv}\n’) 


ion: 

results iv.resids 

resid iv ~ exper + I(expers+2) + motheduc + fatheduc', 
data=mroz) 

results aux = reg aux.fit() 


# auxiliary 
mroz['resid iv'] 
reg aux 


# calculations for test: 

r2 = results aux.rsquared 

n = results aux.nobs 

teststat = n + r2 

pval = 1 - stats.chi2.cdf(teststat, 1) 


print(f'r2: {r2}\n’) 

print(f'n: (n)Wn') 
print(f'teststat: (teststat)|n') 
print(f'pval: (pval)Wn') 


Output of Script 15.5: Example-15-8.py 


table iv: 

b se t pval 
Intercept 0.0481 0.4003 0.1202 0.9044 
exper 0.0442 0.0134 3.2883 0.0011 
I(exper ** 2) -0.0009 0.0004 -2.2380 0.0257 
educ 0.0614 0.0314 1.9530 0.0515 


r2: 0.0008833444088017783 


n: 428.0 


teststat: 0.3780714069671611 


pval: 0.5386371981605377 
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15.6. Instrumental Variables with Panel Data 


Instrumental variables can be used for panel data, too. In this way, we can get rid of time-constant 
individual heterogeneity by first differencing or within transformations and then fix remaining en- 
dogeneity problems with instrumental variables. 

We know how to get panel data estimates using OLS on the transformed data, so we can easily 
use IV as before. 


Wooldridge, Example 15.10: Job Training and Worker Productivity 


We use the data set JTRAIN to estimate the effect of job training hrsemp on the scrap rate. In Script 
15.6 (Examp1e-15-10.py), we load the data, choose a subset of the years 1987 and 1988 with Loc and 
store the data with correct index variables £code and year, see Section 13.3. Then we estimate the 
parameters using first-differencing with the instrumental variable grant. 


— - Script 15.6: Example-15-10.py 
import wooldridge as woo 

import pandas as pd 

import linearmodels.iv as iv 


jtrain = woo.dataWoo('jtrain') 


4 define panel data (for 1987 and 1988 only): 
jtrain 87 88 = jtrain.loc[(jtrain['year'] == 1987) | (jtrain['year'] == 1988), :] 
jtrain 87 88 - jtrain 87 88.set index(['fcode', 'year']) 


# manual computation of deviations of entity means: 
jtrain 87 88['lscrap diffl'] - V 

jtrain 87 88.sort values(['fcode', 'year']).groupby ('fcode') ['lscrap'].diff() 
jtrain 87 88['hrsemp diffl'] - V 

jtrain 87 88.sort values(['fcode', 'year']).groupby ('fcode') ['hrsemp'].diff() 
jtrain 87 88['grant diffl'] = V 

jtrain 87 88.sort values(['fcode', 'year']).groupby ('fcode') [' grant’ ] . di f£ () 


# IV regression: 

reg iv = iv.IV2SLS.from formula( 
formula-'lscrap diffl - 1 + [hrsemp diffl ~ grant diffl]', 
data-jtrain 87 88) 

results iv - reg iv.fit(cov type-'unadjusted', debiased-True) 


# print regression table: 

table iv - pd.DataFrame(('b': round(results iv.params, 4), 
'se': round(results iv.std errors, 4), 
't': round(results iv.tstats, 4), 
'pval': round(results iv.pvalues, 4)}) 

print(f'table iv: \n{table_iv}\n’) 


Output of Script 15.6: Example-15-10.py 
table iv: 
b se t pval 
Intercept -0.0327 0.1270 -0.2573 0.7982 
hrsemp diffl -0.0142 0.0079 -1.7882 0.0808 


16. Simultaneous Equations Models 


In simultaneous equations models (SEM), both the dependent variable and at least one regressor are 
determined jointly. This leads to an endogeneity problem and inconsistent OLS parameter estima- 
tors. The main challenge for successfully using SEM is to specify a sensible model and make sure 
it is identified, see Wooldridge (2019, Sections 16.1-16.3). We briefly introduce a general model and 
the notation in Section 16.1. 

As discussed in Chapter 15, 2SLS regression can solve endogeneity problems if there are enough 
exogenous instrumental variables. This also works in the setting of SEM, an example is given in 
Section 16.2. Using 1inearmodels, more advanced estimation commands are straightforward to 
implement. We will show this for three-stage-least-squares (3SLS) estimation in Section 16.3. 


16.1. Setup and Notation 


Consider the general SEM with 4 endogenous variables yi,...,y, and k exogenous variables 
X1. .., Xy. The system of equations is: 


Ji = 412V2 + 0633J3 + +++ + ay + Bio + Buxi t Bt 
y2 = anıyı + W23y3 +++ + agya c B20 + Brix +--+ + Bax + u2 


Yq = Qni + qay2 +++ + M&gq—1Yq—1 + Bgo + Baix +++ + Bok Xk + ug 


As discussed in more detail in Wooldridge (2019, Section 16), this system is not identified without 
restrictions on the parameters. The order condition for identification of any equation is that if we 
have m included endogenous regressors (i.e. « parameters that are not restricted to 0), we need to 
exclude at least m exogenous regressors (i.e. restrict their B parameters to 0). They can then be used 
as instrumental variables. 


Wooldridge, Example 16.3: Labor Supply of Married, Working Women 
We have the two endogenous variables hours and wage which influence each other. 
hours = a log(wage) + Bio + Bi1educ + Byage + Bi3kids1t 6 + Bygnwifeinc 
+ Bisexper + Bigexper? +14 
log(wage) = aj;hours + Boo + Bz1educ + By age + Bo3kids1t 6 + Boynwifeinc 
+ Basexper + Bagexper? + up 


For both equations to be identified, we have to exclude at least one exogenous regressor from each 
equation. Wooldridge (2019) discusses a model in which we restrict Bis = Bis = 0 in the first and 
B22 = B23 = Ba1 = 0 in the second equation. 
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16.2. Estimation by 2SLS 


Estimation of each equation separately by 2SLS is straightforward once we have set up the system 
and ensured identification. The excluded regressors in each equation serve as instrumental vari- 
ables. As shown in Chapter 15, the command IV2SLS from the module linearmodels provides 
convenient 2SLS estimation. 


Wooldridge, Example 16.5: Labor Supply of Married, Working Women 


Script 16.1 (Examp1e-16-5-2SLS.py) estimates the parameters of the two equations from Example 16.3 
separately using IV2SLS. 


pM — — —— Script 16.1: Example-16-5-2SLS.py 
import wooldridge as woo 
import numpy as np 
import pandas as pd 
import linearmodels.iv as iv 


mroz = woo.dataWoo('mroz') 


# restrict to non-missing wage observations: 
mroz = mroz.dropna (subset-['lwage']) 


reg ivl = iv.IV2SLS.from formula( 
‘hours ~ 1 + educ + age + kidslt6 + nwifeinc +’ 
'[np.log(wage) ~ exper + I(exper««2)]', data=mroz) 

results ivl = reg ivl.fit(cov type-'unadjusted', debiased=True) 


reg iv2 - iv.IV2SLS.from formula( 

‘np.log(wage) ~ 1 + educ + exper + I(expere«2) +’ 

' [hours ~ age + kidslt6 + nwifeinc]', data=mroz) 
results_iv2 = reg iv2.fit(cov type-'unadjusted', debiased=True) 


# print results: 
table ivl = pd.DataFrame(('b 


round(results ivl.params, 4) 
: round(results ivl.std error: 
't': round(results ivl.tstats, 4), 
'pval': round(results ivi.pvalues, 4) }) 
print(f'table ivl: \n{table_iv1}\n’) 


4), 


table iv2 = pd.DataFrame(('b': round(results iv2.params, 4), 
'se': round(results iv2.std errors, 4), 
't': round(results iv2.tstats, 4), 
'pval': round(results iv2.pvalues, 4))) 
print(f'table iv2: \n{table_iv2}\n’) 


cor ulu2 - np.corrcoef(results ivl.resids, results iv2.resids)[0, 1] 
print(f'cor ulu2: (cor ulu2)Vn') 
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Output of Script 16.1: Example-16-5-2SLS.py 


table ivl: 

b se t pval 
Intercept 2225.6618 574.5641 3.8737 0.0001 
educ -183.7513 59.0998 -3.1092 0.0020 
age -7.8061 9.3780 -0.8324 0.4057 
kids1t6 -198.1543 182.9291 -1.0832 0.2793 
nwifeinc -10.1696 6.6147 -1.5374 0.1249 
np.log(wage) 1639.5556 470.5757 3.4841 0.0005 
table iv2: 

b se t pval 
Intercept -0.6557 0.3378 -1.9412 0.0529 
educ 0.1103 0.0155 7.1069 0.0000 
exper 0.0346 0.0195 1.7742 0.0767 
I(exper «« 2) -0.0007 0.0005 -1.5543 0.1209 
hours 0.0001 0.0003 0.4945 0.6212 
cor ulu2: -0.903769419629963 


16.3. Outlook: Estimation by 3SLS 


An interesting piece of information in Script 16.1 (Example-16-5-2SLS. py) is the correlation be- 
tween the residuals of the equations. In the example, it is reported to be a substantially negative 
-0.90. We can account for the correlation between the error terms to derive a potentially more effi- 
cient parameter estimator than 2SLS. Without going into details here, the three stage least squares 
(3SLS) estimator adds another stage to 2SLS by estimating the correlation and accounting for it using 
a FGLS approach. For a detailed discussion of this and related methods, see for example Wooldridge 
(2010, Chapter 8). 

Using 3SLS in 1inearmodels is simple: The function IV3SLS is all we need as the output of 
Script 16.2 (Examp1e-16-5-3SLS.py) shows. 


pM — ——— Script 16.2: Example-16-5-3SLS.py 
import wooldridge as woo 


import numpy as np 
import linearmodels.system as iv3 


mroz = woo.dataWoo('mroz') 


# restrict to non-missing wage observations: 
mroz = mroz.dropna (subset-['lwage']) 


# 3SLS regressions: 
formula = ('eql': ‘hours ~ 1 + educ + age + kidslt6 + nwifeinc +’ 
'[np.log(wage) ~ exper+I(expers*2)]', 
'eq2': 'np.log(wage) ~ 1 + educ + exper + I(exper««2) +! 
' [hours ~ age + kidslt6 + nwifeinc]') 


reg_3sls = iv3.IV3SLS.from_formula(formula, data=mroz) 


results 3sls = reg 3sls.fit(cov type-'unadjusted', debiased-True) 
print(f'results 3sls: \n{results_3s1s}\n’) 
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Output of Script 16.2: Example-16-5-3SLS.py 


results_3sls: 
System GLS Estimation Summar: 


Estimator GLS Overall R-squared -2.3997 
No. Equations.: 2 McElroy’s R-squared: 0.7846 
No. Observations: 428 Judge's (OLS) R-squared: -2.3997 
Date: Fri, May 08 2020 Berndt’s R-squared: 0.5181 
Time: 08:55:43  Dhrymes's R-squared: -2.3997 

Cov. Estimator: unadjustdd 

Num. Constraints: Node 


Equation: eql, Dependent Variable: hours 


Parameter Std. Err. T-stat P-value Lower CI Upper CI 
Intercept 2305.9 511.54 4.5077 0.0000 1300.4 3311.3 
educ -212.82 53.727 -3.9611 0.0001 -318.43 -107.21 
age -9.5150 7.9609 -1.1952 0.2327 -25.163 6.1331 
kidslt6 -192.36 150.92 -1.2746 0.2032 -489.00 104.28 
nwifeinc -0.1770 3.5836 -0.0494 0.9606 -1.2210 6.8670 
np. log (wage) 1781.9 439.88 4.0509 0.0001 917.30 2646.6 


exper, I(exper «« 2) 


Equatio Dependent Variable: np.log (wage) 

Parameter Std. Err. T-stat P-value Lower CI Upper CI 
Intercept -0.6939 0.3360 -2.0653 0.0395 -1.3543 -0.0335 
educ 0.1127 0.0154 7.3355 0.0000 0.0825 0.1429 
exper 0.0214 0.0154 1.3929 0.1644 -0.0088 0.0517 
I(exper «« 2) -0.0003 0.0003 -1.1303 0.2590 -0.0008 0.0002 
hours 0.0002 0.0002 0.7707 0.4413 -0.0003 0.0007 


Instruments 


age, kidslt6, nwifeinc 


Covariance Estimator: 
Homoskedastic (Unadjusted) Covariance (Debiased: True, GLS: True) 


17. Limited Dependent Variable Models and 
Sample Selection Corrections 


A limited dependent variable (LDV) can only take a limited set of values. An extreme case are 
binary variables that can only take two values. We already used such dummy variables as regressors 
in Chapter 7. Section 17.1 discusses how to use them as dependent variables. Another example for 
LDV are counts that take only non-negative integers, they are covered in Section 17.2. Similarly, Tobit 
models discussed in Section 17.3 deal with dependent variables that can only take positive values 
(or are restricted in a similar way), but are otherwise continuous. 

The Sections 17.4 and 17.5 are concerned with continuous dependent variables but are not perfectly 
observed. For some units of the censored, truncated, or selected observations we only know that they 
are above or below a certain threshold or we don’t know anything about them. 


17.1. Binary Responses 


Binary dependent variables are frequently studied in applied econometrics. Because a dummy vari- 
able y can only take the values 0 and 1, its (conditional) expected value is equal to the (conditional) 
probability that y = 1: 
E(y|x) = 0: P(y = 0|x) +1-P(y = 1|x) 
= P(y = 1|x) (17.1) 


So when we study the conditional mean, it makes sense to think about it as the probability of 
outcome y = 1. Likewise, the predicted value jj should be thought of as a predicted probability. 


17.1.1. Linear Probability Models 


If a dummy variable is used as the dependent variable y, we can still use OLS to estimate its relation 
to the regressors x. These linear probability models are covered by Wooldridge (2019) in Section 7.5. 
If we write the usual linear regression model 


y = Bo + Bya +++ + Bex (17.2) 
and make the usual assumptions, especially MLR.4: E(u|x) = 0, this implies for the conditional 
mean (which is the probability that y = 1) and the predicted probabilities: 

Ply = 1|x) = E(y|x) = Bo + Bixi +++ Bex (17.3) 
Py-i- 9 = o+ $i++ Bex (174) 


The interpretation of the parameters is straightforward: B; is a measure of the average change in 
probability of a “success” (y = 1) as x; increases by one unit and the other determinants remain 
constant. Linear probability models automatically suffer from heteroscedasticity, so with OLS, we 
should use heteroscedasticity-robust inference, see Section 8.1. 
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Wooldridge, Example 17.1: Married Women’s Labor Force Participation 


We study the probability that a woman is in the labor force depending on socio-demographic charac- 
teristics. Script 17.1 (&xamp1e-17-1-1.py) estimates a linear probability model using the data set mroz. 
The estimated coefficient of educ can be interpreted as: an additional year of schooling increases the 
probability that a woman is in the labor force ceteris paribus by 0.038 on average. We used the refined 
version of White's robust variance-covariance matrix. 


Script 17.1: Example-17-1-1.py — 
import wooldridge as woo 
import pandas as pd 

import statsmodels.formula.api as smf 


mroz = woo.dataWoo('mroz') 


# 
reg. 


imate linear probability model: 
in = smf.ols(formula-'inlf ~ nwifeinc + educ + exper +’ 
'/I(exper**2) + age + kidslt6 + kidsge6', 
data=mroz) 
results lin = reg lin.fit(cov type-'HC3') 


round(results lin.params, 4), 

: round(results lin.bse, 4), 

round(results lin.tvalues, 4), 
'pval': round(results lin.pvalues, 4))) 

print(f'table: \n(table}\n’) 


Output of Script 17.1: Example-17-1-1.py 


table: 

b se t — pval 
Intercept 0.5855 0.1536 3.8125 0.0001 
nwifeinc -0.0034 0.0016 -2.1852 0.0289 
educ 0.0380 0.0073 5.1766 0.0000 
exper 0.0395 0.0060 6.6001 0.0000 
I(exper ++ 2) -0.0006 0.0002 -2.9973 0.0027 
age -0.0161 0.0024 -6.6640 0.0000 
kids1t6 -0.2618 0.0322 -8.1430 0.0000 
kidsge6 0.0130 0.0137 0.9526 0.3408 


One problem with linear probability models is that P(y = 1|x) is specified as a linear function of 
the regressors. By construction, there are (more or less realistic) combinations of regressor values 
that yield 9 < 0 or 1. Since these are probabilities, this does not really make sense. 

As an example, Script 17.2 (Examp1e-17-1-2 . py) calculates the predicted values for two women 
(see Section 6.2 for how to predict after OLS estimation): Woman 1 is 20 years old, has no work 
experience, 5 years of education, two children below age 6 and has additional family income of 
100,000 USD. Woman 2 is 52 years old, has 30 years of work experience, 17 years of education, no 
children and no other source of income. The predicted “probability” for woman 1 is —41%, the 
probability for woman 2 is 104% as can also be easily checked with a calculator. 
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Script 17.2: Example-17-1-2.py 
import wooldridge as woo 
import pandas as pd 

import statsmodels.formula.api as smf 


mroz = woo.dataWoo('mroz') 


# estimate linear probability model: 
reg lin = smf.ols(formula-'inlf ~ nwifeinc + educ + exper +’ 
'I(exper««2) + age + kidslt6 + kidsge6', 
data-mroz) 
results lin - reg lin.fit(cov type-'HC3') 


# predictions for two "extreme" women: 
X new - pd.DataFrame( 
('nwifeinc': [100, 0], 'educ': [5, 17], 
‘exper’: [0, 30], ‘age’: [20, 52], 
'kidslté': [2, 0], 'kidsge6': [0, 0])) 
predictions = results lin.predict (X new) 


print(f'predictions: \n{predictions}\n’) 


- Output of Script 17.2: Example-17-1-2.py 
predictions: 
0 -0.410458 
1 1.042808 
dtype: float64 


17.1.2. Logit and Probit Models: Estimation 


Specialized models for binary responses make sure that the implied probabilities are restricted be- 
tween 0 and 1. An important class of models specifies the success probability as 


Ply = 1x) = G(Bo + Bui + +--+ Bi) = Gs) (175) 


where the “link function” G(z) always returns values between 0 and 1. In the statistics literature, 
this type of models is often called generalized linear model (GLM) because a linear part xB shows 
up within the nonlinear function G. 
For binary response models, by far the most widely used specifications for G are 
* the probit model with G(z) = ®(z), the standard normal CDF and 
* the logit model with G(z) = A(z) = ees, the CDF of the logistic distribution. 

Wooldridge (2019, Section 17.1) provides useful discussions of the derivation and interpretation of 
these models. Here, we are concerned with the practical implementation. In statsmodels, many 
generalized linear models can be estimated with already implemented routines working similar as 
ols. In the following, we will use two of them frequently: 

* logit for the logit model and 

* probit for the probit model. 
Maximum likelihood estimation (MLE) of the parameters is done automatically and the 
summary of the results contains the regression table and additional information. Scripts 17.3 
(Example-17-1-3.py) and 17.4 (Example-17-1-4.py) implement the logit and probit model, 
respectively. The log likelihood value -2 (Ê) is saved as the attribute 11£ and is also reported by 
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summary. The command also reports LL-Nul1, which is the log likelihood .Zj of a model with an 
intercept only. 

Scripts 17.3 (Example-17-1-3.py) and 17.4 (Example-17-1-4.py) demonstrate how to access 
the log likelihood and McFadden’s pseudo R-squared that can be calculated as 


pseudo R? = p (17.6) 


— Script 17.3: Example-17-1-3.py 
import wooldridge as woo 
import statsmodels.formula.api as smf 


mroz = woo.dataWoo('mroz') 


# estimate logit model: 
reg logit = smf.logit(formula-'inlf ~ nwifeinc + educ + exper +’ 
'/I(exper**2) + age + kidslt6 + kidsge6', 
data=mroz) 


# disp = 0 avoids printing out information during the estimation: 
results logit = reg_logit.fit (disp=0) 
print(f'results logit.summary(): \n{results_logit.summary()}\n’) 


# log likelihood value: 
print(f'results logit.llf: (results logit.11f)Wn') 


# McFadden's pseudo R2: 
print(f'results logit.prsquared: (results logit.prsquared)Wn') 


[E ——— — — — Output of Script 17.3: Example-17-1-3.py 
results logit.summary(): 


Logit Regression Results 


Dep. Variable: inlf No. Observations: 753 


Model: Logit Df Residuals: 745 
Method: MLE Df Model: T 
Date: Thu, 14 May 2020 Pseudo R-squ.: 0.2197 
Time: 12:36:00 Log-Likelihood: -401.77 
converged: True LL-Null: -514.87 
Covariance Type: nonrobust LLR p-value: 3.159e-45 


Intercept 0.4255 0.860 0.494 0.621 =1.261 2.112 
nwifeinc -0.0213 0.008 =-2.535 0.011 -0.038 -0.005 
educ 0.2212 0.043 5.091 0.000 0.136 0.306 
exper 0.2059 0.032 6.422 0.000 0.143 0.269 
I(exper ** 2) -0.0032 0.001 -3.104 0.002 -0.005 -0.001 
age -0.0880 0.015 -6.040 0.000 -0.117 -0.059 
kids1t6 -1.4434 0.204 -7.090 0.000 -1.842 -1.044 
kidsge6 0.0601 0.075 0.804 0.422 -0.086 0.207 


results logit.llf: -401.76515113438177 


results logit.prsquared: 0.21968137484058803 
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I — — —— Script 17.4: Example-17-1-4.py 
import wooldridge as woo 
import statsmodels.formula.api as smf 


mroz = woo.dataWoo('mroz') 


# estimate probit model: 
reg probit = smf.probit(formula-'inlf ~ nwifeinc + educ + exper + 
‘I(experss2) + age + kidslt6 + kidsge6', 
data=mroz) 
results probit = reg probit. fit (disp=0) 
print (f' results_probit .summary(): \n{results_probit .summary()}\n‘) 


# log likelihood value: 
print(f'results probit.llf: {results_probit .11f}\n’) 


# McFadden’s pseudo R2: 
print (f’ results probit .prsquared: (results probit.prsquared)Wn') 


p Output of Script 17.4: Example-17-1-4.py 
results probit.summary(): 
Probit Regression Results 


Dep. Variable: inlf No. Observations: 753 
Model: Probit Df Residuals: 745 
Method: MLE Df Mode. T 
Date: Thu, 14 May 2020 Pseudo R-squ.: 0.2206 
Time: 12:36:01  Log-Likelihood: -401.30 
converged: True — LL-Null: -514.87 
Covariance Type: nonrobust LLR p-value: 2.009e-45 
coef std err z P»Iz| [0.025 0.975] 
Intercept 0.2701 0.509 0.531 0.595 -0.727 1.267 
nwifeinc -0.0120 0.005 -2.484 0.013 -0.022 -0.003 
educ 0.1309 0.025 5.183 0.000 0.081 0.180 
exper 0.1233 0.019 6.590 0.000 0.087 0.160 
I(exper ** 2) -0.0019 0.001 -3.145 0.002 -0.003 -0.001 
age -0.0529 0.008 -6.235 0.000 -0.069 -0.036 
kids1t6 -0.8683 0.119 -7.326 0.000 -1.101 -0.636 
kidsge6 0.0360 0.043 0.828 0.408 -0.049 0.121 


results probit.llf: -401.30219317389515 


results probit.prsquared: 0.22058054372529368 
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17.1.3. Inference 


The summary output of the logit or probit results contains a standard regression table with 
parameters and (asymptotic) standard errors. The next column is labeled z instead of t in the 
output of ols. The interpretation is the same. The difference is that the standard errors only have 
an asymptotic foundation and the distribution used for calculating p values is the standard normal 
distribution (which is equal to the t distribution with very large degrees of freedom). The bottom 
line is that tests for single parameters can be done as before, see Section 4.1. 

For testing multiple hypotheses similar to the F test (see Section 4.3), the likelihood ratio test is 
popular. It is based on comparing the log likelihood values of the unrestricted and the restricted 
model. The test statistic is 

LR = 2( Zur — 4) (17.7) 


where Vur and .Z; are the log likelihood values of the unrestricted and restricted model, respectively. 
Under Hp, the LR test statistic is asymptotically distributed as x? with the degrees of freedom equal 
to the number of restrictions to be tested. The test of overall significance is a special case just like 
with F tests. The null hypothesis is that all parameters except the constant are equal to zero. With 
the notation above, the test statistic is 


LR = 2(£(B) — 4). (17.8) 


Translated to statsmodels with fitted model results stored in results, this corresponds to: 


LR = 2 * (results.llf - results.llnull) 


For other hypotheses, you can compute LR based on the log likelihood of a restricted model. Alter- 
natively, statsmodels offers a Wald test with the function wald test including the convenient 
calculation of p values. Script 17.5 (Examp1e-17-1-5.py) implements the test of overall signifi- 
cance for the probit model using both manual and and automatic calculations. It also tests the joint 
null hypothesis that experience and age are irrelevant by first estimating the restricted model and 
then running the automated test. 


Script 17.5: Example-17-1-5.py 
import wooldridge as woo 
import statsmodels.formula.api as smf 


import scipy.stats as stats 


mroz = woo.dataWoo('mroz') 


4 estimate probit model: 
reg probit = smf.probit(formula-'inlf ~ nwifeinc + educ + exper +’ 
/I(exper**2) + age + kidslt6 + kidsge6', 
data=mroz) 
results probit = reg probit.fit (disp=0) 


# test of overall significance (test statistic and pvalue): 

llrl manual = 2 * (results probit.llf - results probit.llnull) 
print(f'llrl manual: (llrl manual) in') 

print(f'results probit.llr: (results probit.llr)Wn') 
print(f'results probit.llr pvalue: (results probit.llr pvalue) n') 
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# automatic Wald test of HO (experience and age are irrelevant): 
hypotheses = ['exper-0', 'I(exper ++ 2)-0', 'age-0'] 

waldstat - results probit.wald test (hypotheses) 

teststat2 autom - waldstat.statistic 

pval2 autom - waldstat.pvalue 

print(f'teststat2 autom: {teststat2_autom}\n’) 

print(f'pval2 autom: {pval2_autom}\n’) 


# manual likelihood ratio statistic test 
# of HO (experience and age are irrelevant): 
reg probit restr = smf.probit(formula-'inlf ~ nwifeinc + educ +’ 
'kidslt6 + kidsge6', 
data-mroz) 
results probit restr - reg probit restr.fit(disp-0) 


llr2 manual = 2 * (results probit.llf - results probit restr.llf) 
pval2 manual = 1 - stats.chi2.cdf(llr2 manual, 3) 

print(f'llr2 manual2: {11r2_manual}\n’) 

print(f'pval2 manual2: (pval2 manual)Wn') 


m~~ Output of Script 17.5: Example-17-1-5.py 
llrl manual: 227.14202283719214 


results probit.llr: 227.14202283719214 

results probit.llr pvalue: 2.0086732957629427e-45 
teststat2 autom: [[110.91852003]] 

pval2 autom: 6.96073840669924e-24 


llr2 manual2: 127.03401014418023 


pval2 manual2: 0.0 


17.1.4. Predictions 


The command predict can calculate predicted values for the estimation sample ("fitted values") or 
arbitrary sets of regressor values also for binary response models estimated with logit or probit. 


Given the results of the £it method are stored in the variable results, we can calculate: 


* x; for the estimation sample same as results.fittedvalues 
* 9 = G(xiB) for the estimation sample with results. predict () 


* Ñ = G(xiB) for the regressor values stored in xpred with results. predict (xpred) 


The predictions for the two hypothetical women introduced in Section 17.1.1 are repeated for the 
linear probability, logit, and probit models in Script 17.6 (Exampie-17-1-6.py). Unlike the linear 
probability model, the predicted probabilities from the logit and probit models remain between 0 


and 1. 
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Script 17.6: Example-17-1-6.py — ~~~ — 
import wooldridge as woo 
import pandas as pd 

import statsmodels.formula.api as smf 


mroz = woo.dataWoo ('mroz') 


# estimate models: 
reg lin = smf.ols(formula-'inlf ~ nwifeinc + educ + exper +’ 
‘I(expers#2) + age + kidslt6 + kidsge6', 
data=mroz) 
results lin = reg lin.fit(cov type-'HC3') 


reg logit = smf.logit(formula-'inlf ~ nwifeinc + educ + exper +’ 
'I(exper*«2) + age + kidslt6 + kidsge6', 
data=mroz) 
results logit = reg logit.fit(disp-0) 


reg probit = smf.probit(formula-'inlf ~ nwifeinc + educ + exper +’ 
/I(expere«2) + age + kidslt6 + kidsge6', 
data=mroz) 
results probit = reg probit . fit (disp=0) 


# predictions for two "extreme" women: 
X new = pd.DataFrame( 
('nwifeinc': [100, 0], 'educ': [5, 17], 
‘exper’: [0, 30], ‘age’: [20, 52], 
‘kidslt6’: [2, 0], 'kidsge6': [0, 0])) 
predictions lin - results lin.predict(X new) 
predictions logit - results logit.predict(X new) 
predictions probit - results probit.predict(X new) 


print(f'predictions lin: \n{predictions_lin)\n’) 
print(f'predictions logit: \n{predictions_logit}\n’) 
print(f'predictions probit: \n{predictions_probit}\n’) 


Output of Script 17.6: Example-17-1-6.py —————______ 
predictions lin: 
0 -0.410458 
1 1.042808 
dtype: float64 


predictions_logit: 
0 0.005218 

2 0.950049 
dtype: float64 


predictions probit: 
0 0.001065 

1 0.959870 
dtype: float64 
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Figure 17.1. Predictions from Binary Response Models (Simulated Data) 


12] —- linear 
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If we only have one regressor, predicted values can nicely be plotted against it. Figure 17.1 shows 
such a figure for a simulated data set. For interested readers, the script used for generating the data 
and the figure is printed as Script 17.7 (Binary-Predictions.py) in Appendix IV (p. 399). In 
this example, the linear probability model clearly predicts probabilities outside of the “legal” area 
between 0 and 1. The logit and probit models yield almost identical predictions. This is a general 
finding that holds for most data sets. 


17.1.5. Partial Effects 


The parameters of linear regression models have straightforward interpretations: B; measures the 
ceteris paribus effect of x; on E(y|x). The parameters of nonlinear models like logit and probit have a 
less straightforward interpretation since the linear index xf affects through the link function G. 

A useful measure of the influence is the partial effect (or marginal effect) which in a graph like 
Figure 17.1 is the slope and has the same interpretation as the parameters in the linear model. 
Because of the chain rule, it is 


39 _ AG(Bo + Bini +--+ + Bute) 
ax; ax; 
= Bj: g(Bo + Ba Bex), (17.10) 


where g(z) is the derivative of the link function G(z). So 
* for the probit model, the partial effect is 


x = Bj: e(xB) 


(17.9) 
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Figure 17.2. Partial Effects for Binary Response Models (Simulated Data) 
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* for the logit model, it is 
à A 
ayy = Bi AGB) 
where p(z) and A(z) are the PDFs of the standard normal and the logistic distribution, respectively. 
The partial effect depends on the value of xp. The PDFs have the famous bell-shape with highest 
values in the middle and values close to zero in the tails. This is already obvious from Figure 17.1. 
Depending on the value of x, the slope of the probability differs. For our simulated data set, Figure 
17.2 shows the estimated partial effects for all 100 observed x values. Interested readers can see the 
complete code for this as Script 17.8 (Binary-Margeff.py)in Appendix IV (p. 399). 
The fact that the partial effects differ by regressor values makes it harder to present the results in 
a concise and meaningful way. There are two common ways to aggregate the partial effects: 


* Partial effects at the average: PEA = Bj - g(xB) 
* Average partial effects: APE = 17 4 Bj - g(xiB) = Bj - (xB) 


where X is the vector of sample averages of the regressors and g(xf) is the sample average of g 
evaluated at the individual linear index xf. Both measures multiply each coefficient Bj with a 
constant factor. 
The first part of Script 17.9 (Examp1e-17-1-7.py)implements the APE calculations for our labor 
force participation example using already known functions: 
1. The linear indices x;f are accessed using fittedvalues. 
2. The factors g(x) are calculated by using the PDF functions logistic.pdf and norm. pdf 
from the module scipy and then averaging over the sample with mean. 
3. The APEs are calculated by multiplying the coefficients obtained with params with the corre- 
sponding factor. Note that for the linear probability model, the partial effects are constant and 
simply equal to the coefficients. 
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The second part of Script 17.9 (Examp1e-17-1-7.py) shows how this can be done conveniently 
by using the method get_margeff (). All values (except the constant) are replicated. APEs for the 
constant are not part of the methods output since they do not have a direct meaningful interpretation. 
The APEs for the other variables don't differ too much between the models. As a general observation, 
as long as we are interested in APEs only and not in individual predictions or partial effects and as 
long as not too many probabilities are close to 0 or 1, the linear probability model often works well 
enough. 


p M Script 17.9: Example-17-1-7.py 
import wooldridge as woo 
import pandas as pd 
import numpy as np 

import statsmodels.formula.api as smf 
import scipy.stats as stats 


mroz = woo.dataWoo('mroz') 


# estimate models: 

reg lin = smf.ols(formula-'inlf ~ nwifeinc + educ + exper + I(exper««2) +! 
‘age + kidslt6 + kidsge6', data-mroz) 

results lin - reg lin.fit(cov type-'HC3') 


reg logit = smf.logit(formula-'inlf ~ nwifeinc + educ + exper + I(exper««2) +’ 
‘age + kidslt6 + kidsge6', data-mroz) 
results logit = reg logit. fit (disp=0) 


reg probit = smf.probit(formula-'inlf ~ nwifeinc + educ + exper + I(exper««2) +’ 
‘age + kidslt6 + kidsge6’, data=mroz) 
results probit = reg_probit. fit (disp=0) 


# manual average partial effects 
APE_lin 


xb logit = results logit.fittedvalues 
factor logit - np.mean(stats.logistic.pdf(xb logit)) 
APE logit manual = results logit.params + factor logit 


xb probit - results probit.fittedvalues 
factor probit - np.mean(stats.norm.pdf(xb probit)) 
APE probit manual - results probit.params * factor probit 


table manual - pd.DataFrame(('APE lin': np.round(APE lin, 4), 
'APE logit manual': np.round(APE logit manual, 4), 
/APE probit manual': np.round(APE probit manual, 4) }) 
print(f'table manual: \n{table_manual}\n’) 


# automatic average partial effects: 
coef names = np.array(results lin.model.exog names) 
coef names = np.delete(coef names, 0) # drop Intercept 


APE logit autom - results logit.get margeff().margeff 
APE probit autom - results probit.get margeff().margeff 


table auto - pd.DataFrame(('coef names': coef names, 
/APE logit autom': np.round(APE logit autom, 4), 
/APE probit autom': np.round(APE probit autom, 4) }) 
print(f'table auto: \n{table_auto}\n’) 
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Output of Script 17.9: Example-17-1-7.py 
table_manual: 
APE_lin APE_logit_manual APE_probit_manual 


Intercept 0.5855 0.0760 +0812 
nwifeinc -0.0034 -0.0038 .0036 
educ 0.0380 0.0395 0.0394 
exper 0.0395 0.0368 0.0371 
I(exper ** 2) -0.0006 -0.0006 .0006 
age -0.0161 -0.0157 .0159 
kidslt6 -0.2618 -0.2578 .2612 
kidsge6 0.0130 0.0107 0.0108 


table auto: 
coef names APE logit autom APE probit autom 


0 nwifeinc -0.0038 -0.0036 
1 educ 0.0395 0.0394 
2 exper 0.0368 0.0371 
3 I(exper ** 2) -0.0006 -0.0006 
4 age -0.0157 -0.0159 
5 kidslt6 -0.2578 -0.2612 
6 kidsge6 0.0107 0.0108 


17.2. Count Data: The Poisson Regression Model 


Instead of just 0/1-coded binary data, count data can take any non-negative integer 0,1,2,... . If 
they take very large numbers (like the number of students in a school), they can be approximated 
reasonably well as continuous variables in linear models and estimated using OLS. If the numbers 
are relatively small (like the number of children of a mother), this approximation might not work 
well. For example, predicted values can become negative. 

The Poisson regression model is the most basic and convenient model explicitly designed for count 
data. The probability that y takes any value h € {0,1,2,...} for this model can be written as 


ene ex 
MC 


The parameters of the Poisson model are much easier to interpret than those of a probit or logit 
model. In this model, the conditional mean of y is 


P(y =h|x) = (17.11) 


E(ylx) = ef, (17.12) 


so each slope parameter £j has the interpretation of a semi elasticity: 


aE 
uM = pj- e? = B; -E(ylx) (17.13) 
—_1___ eB(ylx) 
© B= Eüh) ax (17.14) 


If x; increases by one unit (and the other regressors remain the same), E(y|x) will increase roughly 
by 100 - B; percent (the exact value is once again 100 - (efi — 1). 

A problem with the Poisson model is that it is quite restrictive. The Poisson distribution implicitly 
restricts the variance of y to be equal to its mean. If this assumption is violated but the conditional 
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mean is still correctly specified, the Poisson parameter estimates are consistent, but the standard 
errors and all inferences based on them are invalid. A simple solution is to interpret the Poisson 
estimators as quasi-maximum likelihood estimators (QMLE). Similar to the heteroscedasticity-robust 
inference for OLS discussed in Section 8.1, the standard errors can be adjusted. 

Estimating Poisson regression models in stat smodels is straightforward. They can be estimated 
using the convenient formula syntax and the command poisson. For the more robust QMLE 
standard errors, we use the command glm with family=sm.families.Poisson(). 


Wooldridge, Example 17.3: Poisson Regression for Number of Arrests 


We apply the Poisson regression model to study the number of arrests of young men in 1986. Script 17.10 
(Example-17-3.py) imports the data and first estimates a linear regression model using OLS. Then, 
a Poisson model is estimated using poisson. Finally, we estimate the same model using the QMLE 
specification with g1m to adjust the standard errors for a potential violation of the Poisson distribution. By 
construction, the parameter estimates are the same, but the standard errors are larger for the QMLE. 


p — — — —— Script 17.10: Example-17-3.py 
import wooldridge as woo 
import pandas as pd 
import statsmodels.api as sm 

import statsmodels.formula.api as smf 


crimel = woo.dataWoo('crimel') 


# estimate linear model: 
reg lin = smf.ols(formula-'narr86 ~ pcnv + avgsen + tottime + ptime86 +’ 
'qemp86 + inc86 + black + hispan + born60', 
data-crimel) 


results lin - g lin.fit() 


# print regression table: 
table lin = pd.DataFrame(('b': round(results lin.params, 4), 

'si round(results lin.bse, 4), 
: round(results lin.tvalues, 4), 
'pval': round(results lin.pvalues, 4)}) 


print(f'table lin: \n{table_lin}\n’) 


# estimate Poisson model: 
reg poisson = smf.poisson(formula-'narr86 ~ pcnv + avgsen + tottime +’ 
'ptime86 + qemp86 + inc86 + black +’ 
'hispan + born60’, 
data-crimel) 
results poisson = reg poisson.fit (disp=0) 


# print regression table: 
table poisson = pd.DataFrame(('b' 


round(results poisson.params, 4), 
‘se’: round(results poisson.bse, 4), 
't': round(results poisson.tvalues, 4), 
'pval': round(results poisson.pvalues, 4)}) 
print(f'table poisson: \n{table_poisson}\n’) 
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# estimate Quasi-Poisson model: 
xeg_qpoisson = smf.glm(formula-'narr86 ~ pcnv + avgsen + tottime + ptime86 +’ 
'qemp86 + inc86 + black + hispan + born60’, 
family-sm.families.Poisson(), 
data-crimel) 
# the argument scale controls for the dispersion in exponential dispersion models, 
# see the module documentation for more details: 
results qpoisson = reg qpoisson.fit(scale-'X2/, disp-0) 


# print regression table: 
table qpoisson = pd.DataFrame({’b’: round(results qpoisson.params, 4), 
‘se’: round(results qpoisson.bse, 4), 
/t': round(results qpoisson.tvalues, 4), 
‘pval’: round(results qpoisson.pvalues, 4)]) 
print(f'table qpoisson: \n{table_qpoisson}\n’) 


Output of Script 17.10: Example-17-3.py 
table lin: 
b se t pval 


Intercept 0.5766 0.0379 15.2150 0.0000 
penv -0.1319 0.0404 -3.2642 0.0011 
avgsen -0.0113 0.0122 -0.9257 0.3547 
tottime 0.0121 0.0094 1.2790 0.2010 
ptime86 -0.0409 0.0088 -4.6378 0.0000 
qemp86 -0.0513 0.0145 -3.5420 0.0004 
inc86 -0.0015 0.0003 -4.2613 0.0000 
black 0.3270 0.0454 7.1987 0.0000 
hispan 0.1938 0.0397 4.8799 0.0000 
born60 -0.0225 0.0333 -0.6747 0.4999 
table poisson: 

b se t — pval 
Intercept -0.5996 0.0673 -8.9158 0.0000 
penv -0.4016 0.0850 -4.7260 0.0000 
avgsen -0.0238 0.0199 -1.1918 0.2333 
tottime 0.0245 0.0148 1.6603 0.0969 
ptime86 -0.0986 0.0207 -4.7625 0.0000 
qemp86 -0.0380 0.0290 -1.3099 0.1902 
inc86 -0.0081 0.0010 -7.7624 0.0000 
black 0.6608 0.0738 8.9503 0.0000 
hispan 0.4998 0.0739 6.7609 0.0000 
born60 -0.0510 0.0641 -0.7967 0.4256 
table qpoisson: 

b se t pval 
Intercept -0.5996 0.0828 -7.2393 0.0000 
penv -0.4016 0.1046 -3.8373 0.0001 
avgsen -0.0238 0.0246 -0.9677 0.3332 
tottime 0.0245 0.0182 1.3481 0.1776 
ptime86 -0.0986 0.0255 -3.8670 0.0001 
qemp86 -0.0380 0.0357 -1.0636 0.2875 
inc86 -0.0081 0.0013 -6.3028 0.0000 
black 0.6608 0.0909 7.2673 20.0000 
hispan 0.4998 0.0910 5.4896 0.0000 
born60 -0.0510 0.0789 -0.6469 0.5177 
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Figure 17.3. Conditional Means for the Tobit Model 


17.3. Corner Solution Responses: The Tobit Model 


Corner solutions describe situations where the variable of interest is continuous but restricted in 
range. Typically, it cannot be negative. A significant share of people buy exactly zero amounts of 
alcohol, tobacco, or diapers. The Tobit model explicitly models dependent variables like this. It can 
be formulated in terms of a latent variable y* that can take all real values. For it, the classical linear 
regression model assumptions MLR.1-MLR.6 are assumed to hold. If y* is positive, we observe 
y — y*. Otherwise, y — 0. Wooldridge (2019, Section 17.2) shows how to derive properties and the 
likelihood function for this model. 

The problem of interpreting the parameters is similar to logit or probit models. While B; measures 
the ceteris paribus effect of x; on E(y* |x), the interest is typically in y instead. The partial effect of 
interest can be written as 


ee Bj? C£) (17.15) 


and again depends on the regressor values x. To aggregate them over the sample, we can either 
calculate the partial effects at the average (PEA) or the average partial effect (APE) just like with the 
binary variable models. 

Figure 17.3 depicts these properties for a simulated data set with only one regressor. Whenever 
y* > 0, y = y^ and the symbols x and + are on top of each other. If y* < 0, then y = 0. Therefore, 
the slope of E(y|x) gets close to zero for very low x values. The code that generated the data set and 
the graph is hidden as Script 17.11 (Tobit -CondMean . py) in Appendix IV (p. 402). 

We use statsmodels for the practical ML estimation, but not in the usual way. The reason 
is that there is no boxed routine to perform the estimation so we have to come up with our own 
definition of a log likelihood. Once we have done this, we let statsmodels do the rest. Before you 
have a look at Script 17.12 (Examp1e-17-2.py) you might want to repeat Section 1.8.4. The basic 
idea is to inherit from the class GenericLikelihoodModel in statsmodels, ie. we reuse its 
attributes and methods and call this new class Tobit. Now, we define the method nloglikeobs, 
which simply gives the code to obtain the negative log likelihood per observation for a given set 
of parameters (i.e. data and coefficients you want to estimate). Wooldridge (2019) provides details 
on the definition of the log likelihood we have implemented here. To keep things simple, we make 
no use of formula syntax and provide the data as matrices with the help of patsy. Because we 
inherited from GenericLikelihoodModel the new class Tobit also has the method £it, which 
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internally calls nloglikeobs multiple times with different values for params to find an optimum 
of the provided log likelihood. We provide OLS results as a start solution for this optimization 
procedure. We finally use the (inherited) method summary to print out nicely formatted outputs 
with the estimated coefficients. 


Wooldridge, Example 17.2: Married Women’s Annual Labor Supply 


We have already estimated labor supply models for the women in the data set mroz, ignoring the fact 
that the hours worked is necessarily non-negative. Script 17.12 (Examp1e-17-2.py) estimates a Tobit 
model accounting for this fact. 


LL — Script 1712: Example-17-2.py 
import wooldridge as woo 
import numpy as np 

import patsy as pt 

import scipy.stats as stats 

import statsmodels.formula.api as smf 
import statsmodels.base.model as smclass 


woo. dataWoo ('mroz') 

pt.dmatrices('hours ~ nwifeinc + educ + exper +’ 
'I(exper++2)+ age + kidslt6 + kidsge6', 
data=mroz, return type-'dataframe') 


# generate starting solution: 
reg ols = smf.ols(formula-'hours ~ nwifeinc + educ + exper + I(exper««2) +! 
‘age + kidslt6 + kidsge6', data=mroz) 
results ols = reg ols.fit() 
sigma start = np.log(sum(results ols.resid ** 2) / len(results ols.resid)) 
params start - np.concatenate((np.array(results ols.params), sigma start), 
axis-None) 


# extend statsmodels class by defining nloglikeobs: 


# for a set of parameters that is provided by the argument "params" 
def nloglikeobs (self, params) : 
# objects in "self" are defined in the parent cla: 


Y 

p = X.shape[1] 

# for details on the implementation see Wooldridge (2019), formula 17.22: 
beta = params[0:p] 

sigma = np.exp(params[p]) 


(y == 0) 


11 = np.empty (len(y)) 

1l[y_eq] = np.log(stats.norm.cdf(-y hat[y eq] / sigma)) 

ll[y g] = np.log(stats.norm.pdf((y - y hat)[y g] / sigma)) - np.log(sigma) 
4 return an array of log likelihoods for each observation: 

return -11 


# results of MLE: 

reg tobit = Tobit (endog=y, exog-X) 

results tobit = reg tobit.fit(start params-params start, maxiter-10000, disp-0) 
print(f'results tobit.summary(): \n{results_tobit.summary()}\n’) 
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Output of Script 17.12: Example-17-2.py 


results tobit.summary(): 


Tobit Results 


Dep. Variable hours Log-Likelihood -3819.1 
Model: Tobit AIC: 7654. 
Method: Maximum Likelihood BIC: 7691. 
Date: Thu, 14 May 2020 
Time: 12:36:10 
No. Observations: 753 
Df Residuals: 745 
Df Model 7 


Intercept 965.3055 446.435 2.162 0.031 90.309 1840.302 
nwifeinc -8.8142 4.459 -1.977 0.048 -17.554 -0.075 
educ 80.6456 21.583 3.736 0.000 38.343 122.948 
exper 131.5643 17.279 7.614 0.000 97.697 165.431 
I(exper ** 2) -1.8642 0.538 -3.467 0.001 -2.918 -0.810 
age -54.4050 7.418 -7.334 0.000 -68.945 -39.865 
kidslt6 -894.0217 111.878 -7.991 0.000 -1113.298 -674.745 
kidsge6 -16.2180 38.640 -0.420 0.675 -91.952 59.516 


paro 7.0229 0.037 189.514 0.000 6.950 7.096 


17.4. Censored and Truncated Regression Models 


Censored regression models are closely related to Tobit models. In fact, their parameters can be 
estimated with nearly the same procedure discussed in the previous section. General censored 
regression models also start from a latent variable y*. The observed dependent variable y is equal 
to y* for some (the uncensored) observations. For the other observations, we only know an upper 
or lower bound for y*. In the basic Tobit model, we observe y — y* in the "uncensored" cases with 
y* > 0 and we only know that y* < 0 if we observe y = 0. The censoring rules can be much 
more general. There could be censoring from above or the thresholds can vary from observation to 
observation. 

The main difference between Tobit and censored regression models is the interpretation. In the 
former case, we are interested in the observed y, in the latter case, we are interested in the underlying 
y*.! Censoring is merely a data problem that has to be accounted for instead of a logical feature of 
the dependent variable. We already know how to estimate Tobit models. With censored regression, 
we can use the same tools. The problem of calculating partial effects does not exist in this case since 
we are interested in the linear E(y*|x) and the slope parameters are directly equal to the partial 
effects of interest. 


!Wooldridge (2019, Section 17.4) uses the notation w instead of y and y instead of y". 
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Wooldridge, Example 17.4: Duration of Recidivism 


We are interested in the criminal prognosis of individuals released from prison. We model the time it 
takes them to be arrested again. Explanatory variables include demographic characteristics as well as 
a dummy variable workprg indicating the participation in a work program during their time in prison. 
The 1445 former inmates observed in the data set recid were followed for a while. 

During that time, 893 inmates were not arrested again. For them, we only know that their true duration 
y* is at least durat, which for them is the time between the release and the end of the observation 
period, so we have right censoring. The threshold of censoring differs by individual depending on when 
they were released. 

In Script 17.13 (Example-17-4.py) we inherit from GenericLikelihoodModel to create a class 
CensReg. Because of the more complicated selection rule, we have to update the __init__ method 
by a parameter cens, which is a dummy variable indicating censored observations. Details on the 
foundation of the implementation for the log likelihood with right censored data in nloglikeobs is 
provided in Wooldridge (2019). 

Estimates can directly be interpreted. Because of the logarithmic specification, they represent semi- 
elasticities. For example, do married individuals take around 100 - Ê = 34% longer to be arrested again. 


(Actually, the accurate number is 100 - (eÊ — 1) = 40%.) There is no significant effect of the work program. 
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Script 1713: Example-17-4.py 
import wooldridge as woo 
import numpy as np 

import patsy as pt 

import scipy.stats as stats 

import statsmodels.formula.api as smf 
import statsmodels.base.model as smclass 


recid = woo.dataWoo(’recid’) 


# define dummy for censored observations: 

censored = recid[’cens’] != 0 

y, X = pt.dmatrices('ldurat ~ workprg + priors + tserved + felon +’ 
‘alcohol + drugs + black + married + educ + age’, 
data=recid, return type-'dataframe') 


# generate starting solution: 
reg ols = smf.ols(formula-'ldurat ~ workprg + priors + tserved + felon +’ 
‘alcohol + drugs + black + married + educ + age’, 

data=recid) 

results ols = reg ols.fit() 

sigma start = np.log(sum(results ols.resid ** 2) / len(results ols.resid)) 

params start - np.concatenate((np.array(results ols.params), sigma start), 

axis-None) 


# extend statsmodels class by defining nloglikeobs: 


class CensReg(smclass.GenericLikelihoodModel): 
def init (self, endog, cens, exog): 
cens 


super(smclass.GenericLikelihoodModel, self). init  (endog, exog, 
missing-' none’) 


def nloglikeobs(self, params): 


p = X.shape[1] 

beta = params[ 

np. exp (params [p]) 

y_hat = np.dot(X, beta) 

11 = np.empty(len(y)) 

# uncensored: 

11[~cens] = np.log(stats.norm.pdf((y - y hat) [~cens] / 
sigma)) - np.log(sigma) 


# censored: 
ll[cens] = np.log(stats.norm.cdf(-(y - y hat) [cens] / sigma) ) 
return -11 


# results of MLE: 

reg censReg = CensReg(endog-y, exog=X, cens-censored) 

results censReg - reg censReg.fit(start params-params start, 
maxiter-10000, method-'BFGS', disp-0) 

print(f'results censReg.summary(): \n{results_censReg.summary()}\n’) 
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Output of Script 17.13: Example-17-4.py 


results censReg.summary (): 


CensReg Results 


ldurat — Log-Likelihood 


CensReg AIC: 3216. 
Maximum Likelihood BIC: 3274. 
Thu, 14 May 2020 
12:36:12 
Observations: 1445 
Df Residuals: 1434 
Df Model 10 


Intercept 4.0994 0.348 11.796 0.000 3.418 4.781 
workprg -0.0626 0.120 -0.521 0.602 -0.298 0.173 
priors -0.1373 0.021 -6.396 0.000 -0.179 -0.095 
tserved -0.0193 0.003 -6.491 0.000 -0.025 -0.013 
felon 0.4440 0.145 3.060 0.002 0.160 0.728 
alcohol -0.6349 0.144 -4.403 0.000 -0.918 -0.352 
drugs -0.2982 0.133 -2.246 0.025 -0.558 -0.038 
black -0.5427 0.117 -4.621 0.000 -0.773 -0.313 
married 0.3407 0.140 2.436 0.015 0.067 0.615 
educ 0.0229 0.025 0.902 0.367 -0.027 0.073 
age 0.0039 0.001 6.450 0.000 0.003 0.005 
par0 0.5936 0.034 17.249 0.000 0.526 0.661 


Truncation is a more serious problem than censoring since our observations are more severely 
affected. If the true latent variable y* is above or below a certain threshold, the individual is not even 
sampled. We therefore do not even have any information. Classical truncated regression models rely 
on parametric and distributional assumptions to correct this problem. In statsmode1s they can be 
implemented by providing an adjusted log likelihood just as discussed above. We will not go into 
details here, but Wooldridge (2019) describes how to implement the log likelihood. 

Figure 17.4 shows results for a simulated data set. Because it is simulated, we actually know the 
values for everybody (hollow and solid dots). In our sample, we only observe those with y > 0 
(solid dots). When applying OLS to this sample, we get a downward biased slope (dashed line). 
Truncated regression fixes this problem and gives a consistent slope estimator (solid line). Script 
17.14 (TruncReg-Simulation.py) which generated the data set and the graph is shown in Ap- 
pendix IV (p. 404). 
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Figure 17.4. Truncated Regression: Simulated Example 
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17.5. Sample Selection Corrections 


Sample selection models are related to truncated regression models. We do have a random sample 
from the population of interest, but we do not observe the dependent variable y for a non-random 
sub-sample. The sample selection is not based on a threshold for y but on some other selection 
mechanism. 

Heckman's selection model consists of a probit-like model for the binary fact whether y is observed 
and a linear regression-like model for y. Selection can be driven by the same determinants as y but 
should have at least one additional factor excluded from the equation for y. Wooldridge (2019, 
Section 17.5) discusses the specification and estimation of these models in more detail. 

The classical Heckman selection model can be estimated either in two steps using software for 
probit and OLS as discussed by Wooldridge (2019) or by a specialized command using MLE. We will 
demonstrate the two step approach with statsmodels. 


Wooldridge, Example 17.5: Wage offer Equation for Married Women 


We once again look at the sample of women in the data set «Roz. Of the 753 women, 428 worked 
(inlf-1) and the rest did not work (inl £=0). For the latter, we do not observe the wage they would 
have gotten had they worked. Script 17.15 (Example-17-5.py) estimates the Heckman selection 
model using two formulas: one for the selection and one for the wage equation. 


statsmodels.formula.api as smf 
import scipy.stats as stats 


mroz = woo.dataWoo('mroz') 


# step 1 (use all n observations to estimate a probit model of s i on z i): 
reg probit = smf.probit(formula-'inlf ~ educ + exper + I(exper««2) +’ 
/nwifeinc + age + kidslt6 + kidsge6', 
data=mroz) 
results probit = reg_probit . fit (disp=0) 
pred inlf = results probit. fittedvalues 
mroz['inv mills'] = stats.norm.pdf(pred inlf) / stats.norm.cdf(pred inlf) 


# step 2 (regress y i on x i and inv mills in sample selection): 

reg heckit = smf.ols(formula-'lwage ~ educ + exper + I(exper**2) + inv mills', 
subset-(mroz['inlf'] -- 1), data-mroz) 

results heckit - reg heckit.fit() 


4 print results: 
print(f'results heckit.summary(): \n{results_heckit . summary ()}\n’) 
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E — — — Output of Script 17.15: Example-17-5.py 
results heckit.summary(): 


OLS Regression Results 


Dep. Variabl lwage ^ R-squared 0.157 
Model: OLS Adj. R-squared: 0.149 
Method: Least Squares — F-statistic: 19.69 
Date: Thu, 14 May 2020 Prob (F-statistic): 7.14e-15 
Time: 12:36:14 — Log-Likelihood: -431.57 
No. Observations: 428 AIC: 873.1 
Df Residuals: 423 BIC: 893.4 
Df Model: 4 
Covariance Type: nonrobust 

coef std err t P>it! [0.025 0.975] 
Intercept -0.5781 0.307 -1.885 0.060 -1.181 0.025 
educ 0.1091 0.016 6.987 0.000 0.078 0.140 
exper 0.0439 0.016 2.684 0.008 0.012 0.076 
I(exper «« 2) -0.0009 0.000 -1.946 0.052 -0.002 8.49e-06 
inv mills 0.0323 0.134 0.240 0.810 -0.232 0.296 
Omnibus: 78.250 — Durbin-Watson: 1.958 
Prob (Omnibus) : 0.000  Jarque-Bera (JB): 299.801 
Skew: -0.761  Prob(JB): 7.93e-66 
Kurtosis: 6.807 Cond. No. 3.61e403 


Warnings: 
[1] Standard Errors assume that the covariance matrix of the errors is correctly 
[2] The condition number is large, 3.61e*03. This might indicate that there are 
strong multicollinearity or other numerical problems. 


18. Advanced Time Series Topics 


After we have introduced time series concepts in Chapters 10 — 12, this chapter touches on some more 
advanced topics in time series econometrics. Namely, we we look at infinite distributed lag models 
in Section 18.1, unit roots tests in Section 18.2, spurious regression in Section 18.3, cointegration in 
Section 18.4 and forecasting in Section 18.5. 


18.1. Infinite Distributed Lag Models 


We have covered finite distributed lag models in Section 10.3. We have estimated those and related 
models in Python using the module statsmodels. In infinite distributed lag models, shocks in the 
regressors z; have an infinitely long impact on yt, y;,1,... . The long-run propensity is the overall 
future effect of increasing z; by one unit and keeping it at that level. 

Without further restrictions, infinite distributed lag models cannot be estimated. Wooldridge (2019, 
Section 18.1) discusses two different models. The geometric (or Koyck) distributed lag model boils 
down to a linear regression equation in terms of lagged dependent variables 


Yt = Mo + V2 + PYt-1 + Ut (18.1) 
and has a long-run propensity of 
LRP = 5 I (18.2) 
The rational distributed lag model can be written as a somewhat more general equation 
Yt = Mo + Voze + pyi-i + mzia + t (18.3) 
and has a long-run propensity of 
LRP = E + n, (18.4) 


In terms of the implementation of these models, there is nothing really new compared to Section 
10.3. The only difference is that we include lagged dependent variables as regressors. 


, Example 18.1: Housing Investment and Residential Price Inflation 


Script 18.1 (Example-18-1 .py) implements the geometric and the rational distributed lag models for 
the housing investment equation. The dependent variable is detrended by the method det rend, which 
simply uses the residual of a regression on a linear time trend. We store this detrended variable in the 
data frame. 

The two models are estimated using statsmodels and a regression table very similar to Wooldridge 
(2019, Table 18.1) is produced. Finally, we estimate the LRP for both models using the formulas 
given above. We first extract the (named) coefficient and then do the calculations. For example, 
results koyck.params["gprice"] is the coefficient with the label "gprice" which in our notation 
above corresponds to y in the geometric distributed lag model. 
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Script 18.1: Example-18-1.py 
import wooldridge as woo 
import pandas as pd 

import statsmodels.formula.api as smf 
import statsmodels.api as sm 


hseinv = woo.dataWoo('hseinv') 


4 add lags and detrend: 

hseinv['linvpc det'] = sm.tsa.tsatools.detrend(hseinv['linvpc']) 
hseinv['gprice lagl'] = hseinv['gprice'].shift(1) 
hseinv['linvpc det lagl'] = hseinv['linvpc det'].shift(1) 


# Koyck geometric d.1.: 

reg koyck - smf.ols(formula-'linvpc det - gprice * linvpc det lagl', 
data-hseinv) 

results koyck - reg koyck.fit() 


# print regression table: 
table koyck - pd.DataFrame(('b 


round(results koyck.params, 4), 
round(results koyck.bse, 4), 
't': round(results koyck.tvalues, 4), 
‘pval’: round(results koyck.pvalues, 4))) 
print(f'table koyck: \n{table_koyck}\n’) 


# rational d.1.: 
reg rational = smf.ols(formula-'linvpc det ~ gprice + linvpc det lagl +’ 
'gprice lagl', 
data-hseinv) 
results rational - reg rational.fit() 


# print regression tabl 
table rational = pd.DataFrame(('b': round(results rational.params, 4), 

'se': round(results rational.bse, 4), 

't': round(results rational.tvalues, 4), 

'pval': round(results rational.pvalues, 4))) 
print(f'table rational: \n{table_rational}\n’) 


# LRP: 
lrp koyck = results koyck.params['gprice'] / ( 

1 - results koyck.params['linvpc det lagl']) 
print(f'lrp koyck: {1rp_koyck}\n’) 


lrp rational = (results rational.params['gprice'] + 
results rational.params['gprice lagl']) / 
1 - results rational.params['linvpc det lagl']) 
print(f'lrp rational: {lrp_rational}\n’) 
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Output of Script 18.1: Example-18-1.py 
table_koyck: 

b se t pval 
Intercept -0.0100 .0179 -0.5561 0.5814 
gprice 3.0948 0.9333 3.3159 0.0020 
linvpc det lagl1 0.3399 0.1316 2.5831 0.0138 


o 


table rational: 


b se E pval 
Intercept 0.0059 0.0169 0.3466 0.7309 
gprice 3.2564 0.9703 3.3559 0.0019 
linvpc det lagl 0.5472 0.1517 3.6076 0.0009 
gprice_lagl -2.9363 0.9732 -3.0172 0.0047 


lrp koyck: 4.688434194769012 


lrp rational: 0.7066808046888197 


18.2. Testing for Unit Roots 


We have covered strongly dependent unit root processes in Chapter 11 and promised to supply tests 
for unit roots later. There are several tests available. Conceptually, the Dickey-Fuller (DF) test is the 
simplest. If we want to test whether variable y has a unit root, we regress Ay; on y; 1. The test 
statistic is the usual t-test statistic of the slope coefficient. One problem is that because of the unit 
root, this test statistic is not t or normally distributed, not even asymptotically. Instead, we have to 
use special distribution tables for the critical values. The distribution also depends on whether we 
allow for a time trend in this regression. 

The augmented Dickey-Fuller (ADF) test is a generalization that allows for richer dynamics in the 
process of y. To implement it, we add lagged values Ay, 1, Ayi-2,... to the differenced regression 
equation. 

E course, working with the special (A)DF tables of critical values is somewhat inconvenient. 
The module statsmodels offers automated DF and ADF tests for models with time trends. The 
command adfuller(y, maxlag - k) performs an ADF test with automatically selecting the 
number of lags in Ay (with k as the maximum amount of lags). For example, adfuller(y, maxlag 
= 0) requests zero lags, ie. a simple DF test. If you set the argument autolag-None the value 
provided in maxlag determines the exact number of considered lags. The argument regression 
allows you to specify your model. Using regression=’ ct’, for example, means that you include 
à constant and a trend. 


Wooldridge, Example 18.4: Unit Root in Real GDP 


Script 18.2 (Examp1e-18-4.py) implements an ADF test for the logarithm of U.S. real GDP including a 
linear time trend. For a test with one lag in Ay and time trend, the equation to estimate is 


Ay = a + byi AV a + dit + er 


We already know how to implement such a regression using ols, so we demonstrate the use of 
adfuller. The relevant test statistic is t = —2.421 and the critical values are given in Wooldridge (2019, 
Table 18.3). More conveniently, the script also reports a p value of 0.37. So the null hypothesis of a unit 
root cannot be rejected with any reasonable significance level. 
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~~~ Script 18.2: Example-18-4.py 
import wooldridge as woo 
import numpy as np 

import pandas as pd 

import statsmodels.api as sm 


inven = woo.dataWoo (/ inven’ ) 
inven['lgdp'] = np.log(inven['gdp']) 


# automated ADF: 
res ADF aut = sm.tsa.stattools.adfuller(inven['lgdp'], maxlag=1, autolag=None, 
regression-'ct', regresults-True) 
ADF stat aut - res ADF aut[0] 
ADF pval aut - res ADF aut[1] 
table - pd.DataFrame(('names': res ADF aut[3].resols.model.exog names, 
'b': np.round(res ADF aut[3].resols.params, 4), 
‘se’: np.round(res ADF aut[3].resols.bse, 4), 
't': np.round(res ADF aut[3].resols.tvalues, 4), 
'pval': np.round(res ADF aut[3].resols.pvalues, 4))) 
print(f'table: \n{table}\n’) 
print(f'ADF stat aut: (ADF stat aut) Wn') 
print(f'ADF pval aut: (ADF pval aut)in') 


Output of Script 18.2: Example-18-4.py 


table: 

names b se t — pval 
0 xl -0.2096 0.0866 -2.4207 0.0215 
1 x2 0.2638 0.1647 1.6010 0.1195 
2 const 1.6627 0.6717 2.4752 0.0190 
3 x3 0.0059 0.0027 2.1772 0.0372 


ADF_stat_aut: -2.420732881476166 


ADF_pval_aut: 0.3686558457135789 


18.3. Spurious Regression 


Unit roots generally destroy the usual (large sample) properties of estimators and tests. A leading 
example is spurious regression. Suppose two variables x and y are completely unrelated but both 
follow a random walk: 


Xt = Xia HAt 
yr — Yi-1 + et, 


where a; and e; are i.i.d. random innovations. If we want to test whether they are related from a 
random sample, we could simply regress y on x. A f test should reject the (true) null hypothesis that 
the slope coefficient is equal to zero with a probability of a, for example 5%. The phenomenon of 
spurious regression implies that this happens much more often. 

Script 183 (Simulate-Spurious-Regression-1.py)simulates this model for one sample. Re- 
member from Section 11.2 how to simulate a random walk in a simple way: with a starting value of 
zero, it is just the cumulative sum of the innovations. The time series for this simulated sample of 
size n — 50 is shown in Figure 18.1. When we regress y on x, the f statistic for the slope parameter is 
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larger than 4 with a p value much smaller than 1%. So we would reject the (correct) null hypothesis 
that the variables are unrelated. 


Figure 18.1. Spurious Regression: Simulated Data from Script 18.3 


Script 18.3: Simulate-Spurious-Regression-l.py 
import numpy np 
import pandas pd 
import statsmodels.formula.api as smf 
import matplotlib.pyplot as plt 
import scipy.stats as stats 


4 set the random seed: 
np.random. d (123456) 


# i.i.d. N(0,1) innovations: 
n= 51 


s.norm.rvs(0, 1, size-n) 


s.norm.rvs(0, 1, size-n) 


independent random walks: 

= np.cumsum(a) 

np.cumsum(e) 

| data = pd.DataFrame(('y': y, ^ 


x) 


# regression: 
reg - smf.ols(formula-'y - x', data-sim data) 
results = reg.fit() 


# print regression table: 
table = pd.DataFrame(('b': round(results.params, 4), 
‘se’: round(results.bse, 4), 
't': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4)]) 
print(f'table: \n{table}\n’) 
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# graph: 

plt.plot (x, color-'black', marker=’’, linestyle-'-', label-'x') 
plt.plot (y, color=’black’, marker-'', linestyle=’-~’, label-'y') 
plt.ylabel('x,y') 

plt.legend() 


plt.savefig('PyGraphs/Simulate-Spurious-Regression-1.pdf') 


Output of Script 18.3: Simulate-Spurious-Regression-1.py 


table: 

b se t pval 
Intercept -6.5100 0.3465 -18.7894 0.0 
x 1.2695 0.0929 13.6607 0.0 


We know that by definition, a valid test should reject a true null hypothesis with a probabil- 
ity of æ, so maybe we were just unlucky with the specific sample we took. We therefore re- 
peat the same analysis with 10,000 samples from the same data generating process in Script 18.4 
(Simulate-Spurious-Regression-2.py). For each of the samples, we store the p value of the 
slope parameter in an array named pvals. After these simulations are run, we simply check how 
often we would have rejected Ho : B1 = 0 by comparing these p values with 0.05. 

We find that in 6,652 of the samples, so in 67% instead of a = 5%, we rejected Hy. So the t test 
seriously screws up the statistical inference because of the unit roots. 


p Script 184: Simulate-Spurious-Regression-2.py 
import numpy as np 

import pandas as pd 

import statsmodels.formula.api as smf 

import scipy.stats as stats 


# set the random seed: 
np.random.seed(123456) 


pvals - np.empty(10000) 
4 repeat r times: 


for i in range(10000): 
# i.i.d. N(0,1) innovations: 


nz51 
e = stats.norm.rvs(0, 1, size-n) 
e[0] = 0 
a = stats.norm.rvs(0, 1, size-n) 
a[0] = 0 


# independent random walks: 

x = np.cumsum(a) 

y = np.cumsum(e) 

Sim data = pd.DataFrame(('y': y, ‘x’: x)) 


4 regression: 
reg = smf.ols(formula-'y - x’, data-sim data) 
results - reg.fit() 

pvals[i] = results.pvalues['x'] 
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# how often is p<=5%: 
count_pval_smaller = np.count_nonzero(pvals <= 0.05) # counts True elements 
print(f'count pval smaller: {count_pval_smaller}\n’) 


4 how often is p>5%: 
count pval greater - np.count nonzero(pvals » 0.05) 
print(f'count pval greater: (count pval greater)in') 


—— — — — —— Output of Script 18.4: Simulate-Spurious-Regression-2.py 
count, pval smaller: 6652 
count, pval greater: 3348 


18.4. Cointegration and Error Correction Models 


In Section 18.3, we just saw that it is not a good idea to do linear regression with integrated variables. 
This is not generally true. If two variables are not only integrated (i.e. they have a unit root), but 
cointegrated, linear regression with them can actually make sense. Often, economic theory suggests a 
stable long-run relationship between integrated variables which implies cointegration. Cointegration 
implies that in the regression equation 


yr = Po + Bixi + ur, 


the error term u does not have a unit root, while both y and x do. A test for cointegration can 
be based on this finding: We first estimate this model by OLS and then test for a unit root in the 
residuals à. Again, we have to adjust the distribution of the test statistic and critical values. This 
approach is called Engle-Granger test in Wooldridge (2019, Section 18.4) or Phillips-Ouliaris (PO) 
test. See the documentation of coint in statsmodels for details on the implementation. 

If we find cointegration, we can estimate error correction models. In the Engle-Granger procedure, 
these models can be estimated in a two-step procedure using OLS. 


18.5. Forecasting 


One major goal of time series analysis is forecasting. Given the information we have today, we want 
to give our best guess about the future and also quantify our uncertainty. Given a time series model 
for y, the best guess for y;.1 given information I; is the conditional mean of E(y;..1|I;). For a model 
like 


yr = o + eayia + mizia + ur, (18.5) 


suppose we are at time t and know both y; and z; and want to predict y;;1. Also suppose that 
E(ui|I,.1) = 0. Then, 


E(ye+a|It) = 60 + eye + ze (18.6) 


and our prediction from an estimated model would be 2,,; = DES iyi + zt. 

We already know how to get in-sample and (hypothetical) out-of-sample predictions including 
forecast intervals from linear models using the command get_prediction. It can also be used for 
our purposes. 
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There are several ways how the performance of forecast models can be evaluated. It makes a 
lot of sense not to look at the model fit within the estimation sample but at the out-of-sample 
forecast performances. Suppose we have used observations y;,. . ., Yn for estimation and additionally 
have observations Yn+1,---,Yn+m- For this set of observations, we obtain out-of-sample forecasts 
fuis fn+m and calculate the m forecast errors 


e —y-f fort=n+1,...,.n+m. (18.7) 


We want these forecast errors to be as small (in absolute value) as possible. Useful measures are 
the root mean squared error (RMSE) and the mean absolute error (MAE): 
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RMSE = i|. o (18.8) 
12s 

MAE = — ) [ensn| (18.9) 
m, 

(18.10) 


Wooldridge, Example 18.8: Forecasting the U.S. Unemployment Rate 


Script 18.5 (Examp1e-18-8.py) estimates two simple models for forecasting the unemployment rate. 
The first one is a basic ARCI) model with only lagged unemployment as a regressor, the second one 
adds lagged inflation. We generate the Boolean variable yt96 to restrict the estimation sample to 
years until 1996. After the estimation, we make predictions including 95% forecast intervals. Wooldridge 
(2019) explains how this can be done manually. We are somewhat lazy and simply use the command 
get prediction. 

Script 18.5 (Examp1e-18-8.py) also calculates the forecast errors of the unemployment rate for the two 
models used in Example 18.8. Predictions are made for the other seven available years until 2003. The 
actual unemployment rate and the forecasts are plotted - the result is shown in Figure 18.2. Finally, 
we calculate the RMSE and MAE for both models. Both measures suggest that the second model 
including the lagged inflation performs better. 
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Script 18.5: Example-18-8 . py 
import wooldridge as woo 
import pandas as pd 

import numpy as np 

import statsmodels.formula.api as smf 
import matplotlib.pyplot as plt 


phillips = woo.dataWoo(’ phillips’) 


# define yearly time series beginning in 1948: 
date range = pd.date range(start-'1948', periods-len(phillips), freq-'Y') 
phillips.index - date range.year 


# estimate models: 

yt96 = (phillips['year'] <= 1996) 

reg 1 = smf.ols(formula-'unem ~ unem_1’, data-phillips, subset=yt96) 
results_1 = reg_1.fit() 

reg 2 = smf.ols(formula-'unem ~ unem 1 + inf 1', data=phillips, subset=yt96) 
results 2 - reg 2.fit() 


# predictions for 1997-2003 including 95% forecast intervals: 
yf97 - (phillips['year'] » 1996) 
pred 1 = results l.get prediction (phillips[yf97]) 
pred 1 FI - pred l.summary frame( 

alpha-0.05)[['mean', 'obs ci lower', 'obs ci upper']] 
pred 1 FI.index - date range.year[yf97] 
print(f'pred 1 FI: \n{pred_1_FI}\n’) 


pred 2 = results 2.get prediction (phillips[yf97]) 
pred 2 FI - pred 2.summary frame( 

alpha-0.05)[['mean', 'obs ci lower', 'obs ci upper']] 
pred 2 FI.index = date range.year[yf97] 
print(f'pred 2 FI: Mn(pred 2 FI)Wn') 


# forecast errors: 
el = phillips[yf97]['unem'] - pred 1 FI['mean'] 
e2 = phillips[yf97]['unem'] - pred 2 FI['mean'] 
# RMSE and MAE: 

rmsel = np.sqrt(np.mean(el ** 2)) 
print(f'rmsel: {rmse1}\n’) 

rmse2 = np.sqrt (np.mean(e2 ++ 2)) 
print(f'rmse2: {rmse2}\n’) 

mael = np.mean (abs (el)) 

print (f’mael: {mae1}\n’) 

mae2 = np.mean (abs (e2) ) 

print (f’mae2: {mae2}\n’) 


# graph: 
plt .plot (phillips [y£97][‘unem’], color-'black', marker-'', label-'unem') 
plt.plot(pred 1 FI['mean'], color-'black', 


marker-'', linestyle-'--', forecast without inflation') 
plt.plot(pred 2 FI['mean'], colo: 3 
marker='', linestyle=’-.’, label=’ forecast with inflation’) 


plt .ylabel (‘unemployment’) 

plt .xlabel (‘time’) 

plt. legend () 

plt .savefig(’PyGraphs/Example-18-8 . pdf’ ) 
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Figure 18.2. Out-of-sample Forecasts for Unemployment 
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Output of Script 18.5: Example-18-8.py 
pred 1 FI: 
mean obs ci lower obs ci upper 
1997 5.526452 3.392840 7.660064 
1998 5.160275 3.021340 7.299210 
1999 4.867333 2.720958 7.013709 
2000 4.647627 2.493832 6.801422 
2001 4.501157 2.341549 6.660764 
2002 5.087040 2.946509 7.227571 
2003 5.819394 3.686837 7.951950 
pred 2 FI: 
mean obs ci lower obs ci upper 
1997 5.348468 3.548908 7.148027 
1998 4.896451 3.090266 6.702636 
1999 4.509137 2.693393 6.324881 
2000 4.425175 2.607626 6.242724 
2001 4.516062 2.696384 6.335740 
2002 4.923537 3.118433 6.728641 
2003 5.350271 3.540939 7.159603 
rmsel: 0.5761199200210152 
rmse2: 0.5217543207440963 
mael: 0.5420140442759066 
mae2: 0.48419452667721685 


19. Carrying Out an Empirical Project 


We are now ready for serious empirical work. Chapter 19 of Wooldridge (2019) discusses the for- 
mulation of interesting theories, collection of raw data, and the writing of research papers. We are 
concerned with the data analysis part of a research project and will cover some aspects of using 
Python for real research. 

This chapter is mainly about a few tips and tricks that might help to make our life easier by 
organizing the analyses and the output of Python in a systematic way. While we have worked with 
Python scripts throughout this book, Section 19.1 gives additional hints for using them effectively in 
larger projects. Section 19.2 shows how the results of our analyses can be written to a text file instead 
of just being displayed on the screen. 

Section 19.3 discusses how Jupyter Notebooks can be used to generate nicely formatted documents 
that present Python code and output at least in a more structured way, potentially even ready for 
publication. Therefore we introduce Markdown, a straightforward markup language and ATEX a 
widely used system which was for example used to generate this book. Jupyter Notebooks efficiently 
use Python, Markdown and IATEX together to generate anything between clearly laid out results 
documentations and complete little research papers that automatically include the analysis results. 


19.1. Working with Python Scripts 


We already argued in Section 1.1.2 that anything we do in Python or any other statistical package 
should be done in scripts or the equivalent. In this way, it is always transparent how we generated 
our results. A typical empirical project has roughly the following steps: 

1. Data Preparation: import raw data, recode and generate new variables, create sub-samples, ... 

2. Generation of descriptive statistics, distribution of the main variables, ... 

3. Estimation of the econometric models 

4. Presentation of the results: tables, figures, ... 

If we combine all these steps in one Python script, it is very easy for us to understand how we came 
up with the regression results even a year after we have done the analysis. At least as important: It is 
also easy for our thesis supervisor, collaborators or journal referees to understand where the results 
came from and to reproduce them. If we made a mistake at some point or get an updated raw data 
set, it is easy to repeat the whole analysis to generate new results. 

It is crucial to add helpful comments to the Python scripts explaining what is done in each step. 
Scripts should start with an explanation like the following: 
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LLL — — — Script 19.1: ultimate-calcs.py 
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH 
Project X: 

"The Ultimate Question of Life, the Universe, and Everything" 

Project Collaborators: Mr. X, Mrs. Y 


Python Script "ultimate-calcs" 

by: F Heiss 

Date of this version: February 18, 2019 
AHHHEHHIHIHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHRHE 
# external modules: 

import numpy as np 

import datetime as dt 


Se a e dp db GR GR 


# create a time stamp: 
ts - dt.datetime.now() 


# print to logfile.txt (/w' resets the logfile before writing output) 

# in the provided path (make sure that the folder structure 

# you may provide already exists): 

print (f'This is a log file from: \n{ts}\n’, 
file-open('Pyout/19/logfile.txt', 'w')) 


# the first calculation using the function "square root" from numpy: 
resultl - np.sqrt (1764) 


# print to logfile.txt but with keeping the previous results (‘a’): 
print(f'resultl: (resultl)Wn', 
file-open('Pyout/19/logfile.txt', 'a')) 


# the second calculation reverses the first one: 
result2 - resultl ** 2 


# print to logfile.txt but with keeping the previous results (‘a’) 
print(f'result2: (result2)', 
file-open('Pyout/19/logfile.txt', 'a')) 


In the next section, we will explain the details of Script 19.1 (ult imate-calcs.py). If a project 
requires many and/or time-consuming calculations, it might be useful to separate them into several 
Python scripts. For example, we could have four different scripts corresponding to the steps listed 
above: 

* data.py 

* descriptives.py 

* estimation.py 

* results.py 
So once the potentially time-consuming data cleaning is done, we don't have to repeat it every 
time we run regressions. Instead, we save the cleaned data as an intermediary step and load it in 
subsequent analyses. To avoid confusion, it is highly advisable to document interdependencies. Both 
descriptives.py and estimation .py should at the beginning have a comment like: 


E Depends on data.py 


And results. py could have a comment like: 


[s Depends on estimation.py 
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19.2. Logging Output in Text Files 


Having the results appear on the screen and being able to copy and paste from there might work for 
small projects. For larger projects, this is impractical. A straightforward way for writing all results 
to a file is to use the command print and route the output not to the console but a log file. If we 
want to write the output of a print command to a file logfile.txt, the basic syntax is: 


print (result, file=open(’logfile.txt’, 'w')) 


Script 19.1 (ult imate-calcs.py) gives a demonstration and also explains that the second ar- 
gument of open controls for resetting the log file (‘w’) or append the results to an existing one 
(‘a’). See the documentation for other available options. We also include a time stamp, to 
document when we performed our analyses as the following log file resulting from Script 19.1 
(ultimate-calcs.py) shows: 


File logfile.txt 
This is a log file from: 

2020-05-14 12:57:38.996493 

resultl: 42.0 


result2: 1764.0 


There are other ways to document the results of your work. For example, you could globally 
define that all returns of print commands should be directed to the log file with sys.stdout = 
open (‘ logfile2.txt’, 'w'). Script 19.2 (ultimate-calcs2.py) demonstrates this alterna- 
tive and produces the same log file. Finally, we want to mention the module logging providing a 
set of convenient functions to document events like errors or warnings during the execution of your 
program. For the scope of this book however, the usual print statement should be sufficient. 

Script 19.2: ultimate-calcs2.py 
# external modules: 


import numpy as np 
import datetime as dt 
import sys 


# make sure that the folder structure you may provide already exists: 
sys.stdout = open('Pyout/19/logfile2.txt', 'w') 


# create a time stamp: 
ts = dt.datetime.now() 


# print to logfile2.txt: 
print (f' This is a log file from: \n{ts}\n’) 


# the first calculation using the function "square root" from numpy: 
resultl = np.sqrt (1764) 


# print to logfile2.txt: 
print(f'resultl: {result1}\n’) 


# the second calculation reverses the first one: 
result2 = resultl ++ 2 


# print to logfile2.txt: 
print(f'result2: (result2)') 
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Figure 19.1. Creating a Jupyter Notebook 
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19.3. Formatted Documents with Jupyter Notebook 


Jupyter Notebook is an open source and web based environment that is maintained by the Project 
Jupyter! A Jupyter Notebook is used to produce documents containing code, formatted text in- 
cluding equations and graphs. You can choose among many formats to export a Jupyter Notebook. 
Note that although we will use it for Python code only, many other languages like R or Julia are 
supported.? 

The Anaconda distribution of Python already comes with everything we need to create a Jupyter 
Notebook. You can also install it manually as explained on https://jupyter.org/. In the 
following, we introduce the interface of Jupyter Notebook and the two important building blocks: 
Code and Markdown cells. 


19.3.1. Getting Started 


You find Jupyter Notebook in the Anaconda Navigator, which were both set up during the installa- 
tion of Anaconda. After clicking on the icon, your web browser opens and should look similar to 
Figure 19.1. The figure also shows how to create a new Notebook: New—Python 3. This creates an 
empty Notebook similar as in Figure 19.2. 


19.3.2. Cells 


Let's start to enter some Python code into the displayed box starting with “In{ ]:" in Figure 19.2. 
This box is referred to as a "cell" in a Jupyter Notebook and we choose 3««2 as an exemplary input 
for such a cell in the upper screenshot in Figure 19.3. You can execute the code by clicking on * 
and immediately inspect the output in the appearing box starting with “Out [ ]:" (also shown in 
Figure 19.3). By default, Jupyter Notebook expects you to enter Python code in a cell, which is also 
visualized by the field next to ^ saying "Code". You can add more cells by clicking on * . 

In the next step we create another cell and select Markdown in the drop down menu next to ” . 
We can now enter text and use Markdown commands to format it. The lower two screenshots of 


1For more information, see Kluyver, Ragan-Kelley, Pérez, Granger, Bussonnier, Frederic, Kelley, Hamrick, Grout, Corlay, 
Ivanov, Avila, Abdalla, Willing, and development team (2016). 
? Actually, the name Jupyter is based on the three languages Julia, Python and R. 
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Figure 19.2. An Empty Jupyter Notebook 
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Figure 19.3 give an example. Here we use ««some text** to print bold text and x to create a list 
with bullet points. More useful Markdown commands are explained in the next subsection. After 
entering the Markdown text click on " to apply your formatting commands. Instead of printing an 
output box, the cell you previously worked on is replaced by the formatted text. To edit the cell later, 
just double click on it. 

To export your Notebook use File—Download as and choose a format, for example formatted 
HTML or PDF . 


19.3.3. Markdown Basics 


Markdown cells include normal text, formatting instructions and IATX equations. There are count- 
less possibilities to create appealing Markdown cells. We can only give a few examples for the most 
important formatting instructions: 
* # Header 1, ## Header 2, and ### Header 3 produce different levels of headers. 
e «wordx prints the word in italics. 
* xx*wordx« prints the word in bold. 
e ‘‘word** prints the word in code-like typewriter font (obviously not for Python code 
you want to execute). 
* Wecan create lists with bullets using x at the beginning of a line followed by a whitespace. 
e If you are familiar with ATEX, displayed and inline formulas can be inserted using $. . . $ and 
$$...$$ and the usual TEX syntax, respectively. 
Different formatting options are demonstrated in the following Jupyter Notebook. It can be down- 
loaded in the . ipynb format from http: / /www.UPfIE.net. We start by showing you a collection 
of all Code and Markdown cells we entered in our Jupyter Notebook: 


File narkdown-cell-1.txt 
# Working with Jupyter Notebook 

The following example is based on Script ''Descr-Figures'' from Chapter 2 and 
demonstrates the use of **Jupyter Notebooks** to document your work step by step. 
We will describe the two most important building blocks: 


* basic Markdown commands to format your text in ''Markdown'" cells 


STEX is a powerful and free system for generating documents. In economics and other fields with a lot of maths involved, 
it is widely used — in many areas, it is the de facto standard. It is also popular for typesetting articles and books. This 
book is an example for a complex document created by TEX. At least basic knowledge of ATEX is needed to follow the 
equation related parts. 
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Figure 19.3. Cells in Jupyter Notebook 
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* how to import and run Python code in ''Code'^ cells 


## Import and Prepare Data 
Let's start by importing all external modules: 


File code-cell-1.txt 


import wooldridge as woo 
import numpy as np 

import pandas as pd 

import matplotlib.pyplot as plt 


File narkdown-cell-2.txt 
In the next step, we import our data and define important variables: 


File code-ce11-2.txt 
affairs = woo.dataWoo('affairs') 


# use a pandas.Categorical object to attach labels: 
affairs['haskids'] = pd.Categorical.from codes (affairs['kids'], 

categories-['no', 'yes']) 
counts = affairs['haskids'].value counts() 


File markdown-cell-3.txt — 


### View your 
To get an overview you could use ‘‘affairs.head()**. 


### Calculate Descriptive Statistics 
Up to this point, the code cells above produced no output. 
This will change now, as we are interested in some results. 
Let's start with printing out the average age. We start with 
its definition and use LaTeX to enter the equation: 

$$ \bar{x} = \frac{1}{N} Vsum (i-1)^N x (i) $$ 

The resulting Python code givi 


File code-ce11-3.txt 
age mean = np.mean(affairs['age']) 
print (age mean) 


File narkdown-cell-4.txt 


### Produce Graphic Results 
In Chapter 2, we saw how to produce a pie chart. Let's repeat it here: 


File code-cell-4.txt 
plot = plt.pie(counts, labels-['no', 'yes']) 


File narkdown-cell-5.txt 
You can also show Python code without executing it. 
You can use ‘*inline code'^, or for longer paragraphs 
"python 

plt.bar(['/no'/, 'yes'], counts, color-'dimgrey') 


We exported the Jupyter Notebook into PDF and produced the following document: 
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Figure 19.4. Example of an Exported Jupyter Notebook 


jupyter-example 
May 14, 2020 


1 Working with Jupyter Notebook 


The following example is based on Script Descr-Figures from Chapter 2 and demonstrates the 
use of Jupyter Notebooks to document your work step by step. We will describe the two most. 
important building blocks: 


+ basic Markdown commands to format your text in Markdown cells 
* how to import and run Python code in Code cells 

1. Import and Prepare Data 

Let's start by importing all external modules: 


[1]: import wooldridge as woo 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 


In the next step, we import our data and define important variables: 
[2]: affairs = woo.dataWoo('affairs') 
# use a pandas.Categorical object to attach Labels 


affairs['haskids'] = pd.Categorical.from_codes(affairs['kids'], 
categories-['no', 'y 


counts = affairs['haskids'].value counts) 


1.2 Analyse Data 
1.1 View your Data 


‘To get an overview you could use affair 


Tre] 


1.22 Calculate Descriptive Statistics 


Up to this point, the code cells above produced no output. This will change now, as we are interested 
in some results. Let's start with printing out the average age. We start with its definition and use 
LaTeX to enter the equation: 
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Figure 19.5. Example of an Exported Jupyter Notebook (cont’ed) 


‘The resulting Python code gives: 


[4]: age mean = np.nean(affairs['age']) 
print(age mean) 


32.48752079866888 


1.23 Produce Graphic Results 
In Chapter 2, we saw how to produce a pie chart. Let's repeat it here: 


[5]: plot = plt.pie(counts, labels-['no', D 


You can also show Python code without executing it. You can use inline code, or for longer 
paragraphs 


plt.bar(['no', 'yes'], counts, color='dingrey') 


Part IV. 


Appendices 


Python Scripts 


1. Scripts Used in Chapter 01 


Script 1.1: First-Python-Script .py 
# This is a comment. 
# in the next line, we try to enter Shakespeare: 
‘To be, or not to be: that is the question’ 
# let’s try some sensible math: 
print((1 + 2) « 5) 
16 «* 0.5 
print (’\n’) 


Script 1.2: Python-as-a-Calculator.py 
resultl = 1 + 1 
print(f'resultl: {result1}\n’) 


result2 = 5 « (4 - 1) «« 2 
print(f'result2: {result2}\n’) 


result3 - [resultl, result2] 
print(f'result3: \n{result3}\n’) 


Script 1.3: Module-Math.py 
import math as someAlias 


resultl - someAlias.sqrt (16) 
print(f'resultl: (resultl)Wn') 


result2 - someAlias.pi 
print(f'Pi: {result2}\n’) 


result3 - someAlias.e 
print(f'Eulers number: {result3}\n’) 


Script 1.4: Objects-in-Python.py 
resultl = 1 + 1 
# determine the type: 
type resultl - type(resultl) 
# print the result: 
print(f'type resultl: (type resultl)') 


result2 - 2.5 
type result2 = type(result2) 
print(f'type result2: (type result2)') 


result3 = 'To be, or not to be: that is the question’ 
type result3 - type(result3) 
print(f'type result3: (type result3)in') 
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Script 1.5: Lists-Copy.py 
# define a list: 
example list - [1, 5, 41.3, 2.0] 


# be careful with changes on variables pointing on example list: 
duplicate list - example list 

duplicate list[3] - 10000 

print(f'duplicate list: (duplicate list) in') 

print(f'example list: (example list) in') 


# work on a copy of example list: 

example list - [1, 5, 41.3, 2.0] 

duplicate list - example list[:] 

duplicate list[3] - 10000 

print(f'duplicate list: (duplicate list)n') 
print(f'example list: (example list)n') 


Script 1.6: Lists.py 


# define a list: 
example list [1, 5, 41.3, 2.0] 
print(f'type(example list): (type(example list))Wn') 


# access first entry by index: 
first entry - example list[0] 
print(f'first entry: {first_entry}\n’) 


# access second to fourth entry by index: 
range2to4 - example list[1:4] 
print(f'range2to4: (range2to4)Wn') 


# replace third entry by new value: 
example list[2] - 3 


print(f'example li. (example list) An') 


# apply a function: 
function output = min(example list) 
print(f'function output: (function output)Wn') 


# apply a method: 
example list.sort() 
print(f'example list: (example list) n') 


# delete third element of sorted list: 
del example list[2] 
print(f'example list: (example list) n') 


— ———————— Script 17: Dicts-Copy.py - — 
# define and print a dict: 


varl = ['Florian', 'Daniel'] 
var2 - [96, 49] 
var3 - [True, False] 


example dict - dict(name-varl, points-var2, passed-var3) 
print(f'example dict: (example dict) n') 


# if you want to work on a copy: 

import copy 

copied dict = copy.deepcopy (example dict) 

copied dict['points'][1] = copied dict['points'][1] - 40 
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print(f'example dict: \n{example_dict}\n’) 
print(f'copied dict: \n{copied_dict}\n’) 


Script 1.8: Dicts.py 


# define and print a dict: 

varl = ['Florian', 'Daniel'] 

var2 = [96, 49] 

var3 = [True, False] 

example dict = dict (name=varl, points-var2, passed=var3) 
print(f'example dict: \n{example_dict}\n’) 


# another way to define the dict: 
example dict2 = ('name': varl, ‘points’: var2, ‘passed’: var3} 
print(f'example dict2: \n{example_dict2}\n’) 


# get data type: 
print(f'type(example dict): {type (example_dict) }\n’) 


# access ‘points’: 
points all = example dict['points'] 
print(f'points all: (points all)in') 


# access ‘points’ of Daniel: 
points daniel - example dict['points'][1] 
print(f'points daniel: (points daniel)Wn') 


# add 4 to ‘points’ of Daniel and let him pass: 
example dict ['points'][1] 


example dict['points'][1] * 4 
= True 
print(f'example dict: \n{example_dict}\n’) 


# add a new variable ‘grade’: 
example dict['grade'] - [1.3, 4.0] 


# delete variable ‘points’: 
del example dict['points'] 
print(f'example dict: Wn(example dict] n') 


Script 1.9: Numpy-Arrays.py 
import numpy as np 


# define arrays in numpy: 
testarraylD = np.array([1, 5, 41.3, 2.0]) 
print (f' type (testarray1D): (type(testarraylD))Wn') 


testarray2D = np.array([[4, 9, 8, 3] 
I2, 6, 3, 2], 
1, 7, 4) 


# get dimensions of testarray2D: 
dim = testarray2D. shape 
print (f/dim: {dim}\n’) 


# access elements by indices: 
third_elem = testarray1D[2] 
print(f'third elem: {third_elem}\n’) 


second third elem = testarray2D[1, 2] # element in 2nd row and 3rd column 
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print(f'second third elem: (second third elem) Wn') 


second to third col = testarray2D[:, 1:3] # each row in the 2nd and 3rd column 
print(f'second to third col: Wn(second to third col)Wn') 


# access elements by lists: 
first third elem = testarray1D[[0, 2]] 
print(f'first third elem: (first third elem)Wn') 


# same with Boolean lists: 
first third elem2 - testarraylD[[True, False, True, False]] 
print(f'first third elem2: (first third elem2)n') 


k = np.array([[True, False, False, False], 

[False, False, True, False], 

[True, False, True, False]]) 
elem by index = testarray2D[k] # 1st elem in 1st row, 3rd elem in 2nd row... 
print(f'elem by index: (elem by index}\n’) 


Script 1.10: Numpy-SpecialCases.py 
import numpy as np 


# array of integers defined by the arguments start, end and sequence length: 
sequence - np.linspace(0, 2, num-11) 
print(f'sequence: \n{sequence}\n’ ) 


# sequence of integers starting at 0, ending at 5-1: 
sequence_int = np.arange(5) 
print (f' sequence int: \n{sequence_int}\n’) 


# initialize array with each element set to zero: 
zero_array = np.zeros((4, 3)) 
print(f'zero array: \n{zero_array}\n’) 


# initialize array with each element set to one: 
one array - np.ones((2, 5)) 
print(f'one array: \n{one_array}\n’) 


# uninitialized array (filled with arbitrary nonsense elements): 
empty array - np.empty((2, 3)) 
print(f'empty array: Mn(empty array) Wn') 


Script 1.11: Numpy-Operations.py 
import numpy as np 


# define an arrays in numpy: 


matl = np.array([[4, 9, 8], 
I2, 6, 31) 

mat2 = np.array([[l, 5, 2], 
, 6, 0], 

I4, 8, 311) 


# use a numpy function: 
resultl = np.exp(mati) 
print(f'resultl: \n{result1}\n’) 


result2 = mati + mat2[[0, 1]] # same as np.add(matl, mat2[[0, 1]]) 
print(f'result2: \n{result2}\n’) 
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# use a method: 
matl tr = matl.transpose() 
print(f'matl tr: \n{matl_tr}\n’) 


# matrix algebra: 
matprod = matl.dot(mat2) # same as matl @ mat2 
print (f’matprod: \n{matprod}\n’) 


Script 112: Pandas.py 
import numpy as np 
import pandas as pd 


# define a pandas DataFrame: 

icecream sales - np.array([30, 40, 35, 130, 120, 60]) 

weather coded = np.array([0, 1, 0, 1, 1, 0]) 

customers - np.array([2000, 2100, 1500, 8000, 7200, 2000]) 

df = pd.DataFrame(('icecream sales': icecream sales, 
‘weather_coded’: weather coded, 
‘customers’: customers}) 


# define and assign an index (six ends of month starting in April, 2010) 
# (details on generating indices are given in Chapter 10): 

ourIndex = pd.date range(start-'04/2010', freq-'M', periods=6) 

df.set index(ourIndex, inplaci 


4 print the DataFrame 
print(f'df: \n{d£}\n’) 


# access columns by variable names: 
subsetl - df[['icecream sales', 'customers']] 
print (f' subset1: \n{subset1}\n’) 


# acc second to fourth row: 
subset2 df[1:4] # same as df['2010-05-31':'2010-07-31'] 
print(f'subset2: \n{subset2}\n’) 


rows and columns by index and variable name: 
df.loc['/2010-05-31', ‘customers’] # same as df.iloc[1,2] 
print(f'subset3: \n{subset3}\n’) 


# access rows and columns by index and variable integer positions: 
subset4 = df.iloc[1:4, ] 
# same as df.loc['2010-05-31':'2010-07-31', ['icecream sales','weather']] 
print(f'subset4: \n{subset4}\n’) 


Script 1.13: Pandas-Operations.py 
import numpy as np 
import pandas as pd 


# define a pandas DataFrame: 

icecream sales = np.array([30, 40, 35, 130, 120, 60]) 

weather coded = np.array([0, 1, 0, 1, 1, 0]) 

customers = np.array([2000, 2100, 1500, 8000, 7200, 20001) 

df = pd.DataFrame(('icecream sales': icecream sales, 
'weather coded': weather coded, 
'customers': customers}) 
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# define and assign an index (six ends of month starting in April, 2010) 
# (details on generating indices are given in Chapter 10): 

ourIndex = pd.date_range(start='04/2010', freq-'M', periods-6) 

df.set index(ourIndex, inplace-True) 


# include sales two months ago: 
df['icecream sales lag2'] = df['icecream sales'].shift(2) 
print(f'df: \n{d£}\n’) 


# use a pandas.Categorical object to attach labels (0 = bad; 1 = good): 
df['weather'] = pd.Categorical.from codes (codes-df['weather coded'], 

categories-['/bad', 'good']) 
print(f'df: \n{d£}\n’) 


# mean sales for each weather category: 
group means = df.groupby('weather').mean() 
print(f'group means: WMi(group means) Wn') 


Script 1.14: Wooldridge.py 
import wooldridge as woo 


# load data: 
wagel = woo.dataWoo(’wagel’) 


# get type: 
print(f'type(wagel): \n{type(wagel) }\n’) 


# get an overview: 
print (f’wagel.head(): \n{wagel.head()}\n’) 


Script 1.15: Import-Export.py 
import pandas as pd 


# import csv with pandas: 
dfl = pd.read csv('data/ 


ales.csv', delimiter-',', header=None, 
‘year’, 'productl', 'product2', 'product3']) 


print(f'dfl: Mi(dfl)Wn') 


# import txt with pandas: 
df2 = pd.read table('data/sales.txt', delimiter-' ') 
print(f'df2: \n{d£2}\n’) 


# add a row to df1: 

df3 - dfl.append(('year': 2014, 'productl': 10, 'product2': 8, 'product3': 2), 
ignore index-True) 

print(f'df3: \n{d£3}\n’) 


4 export with pandas: 
d£3.to csv('data/sales2.csv') 


~~~ — —— Script 1.16: Import-StockData.py 
import pandas datareader as pdr 


# download data for 'F' (= Ford Motor Company) and define start and end: 
tickers = ['F'] 

start date = '2014-01-01' 

end date = '2015-12-31' 
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# use pandas_datareader for the import: 
F data = pdr.data.DataReader (tickers, ‘yahoo’, start date, end date) 


# look at imported data: 
print(f'F data.head(): \n{F_data.head()}\n’) 
print(f'F data.tail(): \n{F_data.tail()}\n’) 


Script 117: Graphs-Basics.py 
import matplotlib.pyplot as plt 


# create data: 
x = [1, 3, 4, 7, 8, 9] 
y = [0, 3, 6, 9, 7, 8] 


# plot and save: 

plt.plot(x, y, color-'black') 
plt.savefig('PyGraphs/Graphs-Basics-a.pdf') 
pit.close() 


M — — — — —— Script 1.18: Graphs-Basics2.py 
import matplotlib.pyplot as plt 


# create data: 
x [1, 3, 4, 7, 8, 9] 
y = [0, 3, 6, 9, 7, 8] 


# plot and save: 

plt.plot(x, y, color=’black’, linestyle=’--’) 
plt.savefig('PyGraphs/Graphs-Basics-b.pdf') 
plt.close() 


plt.plot (x, y, color=’black’, linestyle=':’) 
plt.savefig('PyGraphs/Graphs-Basics-c.pdf') 
plt.close() 


plt.plot(x, y, color=’black’, linestyle-'-', linewidth=3) 
plt.savefig('PyGraphs/Graphs-Basics-d.pdf') 
plt.close() 


plt.plot(x, y, color-'black', marker-'o') 
plt.savefig('PyGraphs/Graphs-Basics-e.pdf') 
plt.close() 


plt.plot(x, y, color-'black', marker-'v', linestyle-'') 
plt.savefig('PyGraphs/Graphs-Basics-f.pdf') 


E ————— Script 1.19: Graphs-Functions.py 
import scipy.stats as stats 
import numpy as np 

import matplotlib.pyplot as plt 


4 support of quadratic function 

# (creates an array with 100 equispaced elements from -3 to 2): 
x1 - np.linspace(-3, 2, num-100) 

# function values for all these values: 

yl = xl ++ 2 
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# plot quadratic function: 
plt.plot (x1, yl, linestyle-'-', color-'black') 
plt.savefig('PyGraphs/Graphs-Functions-a.pdf') 
plt.close() 


# same for normal density: 
x2 - np.linspace(-4, 4, num-100) 
y2 = stats.norm.pdf (x2) 


# plot normal density: 
plt.plot (x2, y2, linestyle-'-', color-'black') 
plt.savefig('PyGraphs/Graphs-Functions-b.pdf') 


Script 1.20: Graphs-BuildingBlocks.py — —— —— I 
import scipy.stats as stats 
import numpy as np 
import matplotlib.pyplot as plt 


# support for all normal densities: 
x = np.linspace(-4, 4, num=100) 

# get different density evaluations: 
yl = stats.norm.pdf(x, 0, 1) 


y2 = stats.norm.pdf(x, 1, 0.5) 
y3 = -norm.pdf (x, 0, 2) 
# plot 


plt.plot (x, yl, linestyle-'-', color-'black', label-'standard normal’) 
, Colorz'0.3', label=’mu = 1, sigma = 0.5" 


color-'0.6', label=’$\mu = 0$, $\sigma 


) 
2$') 


plt.title('/Normal Densities’) 

plt .ylabel (’$\phi (x) $’) 

plt.xlabel('x') 

plt.legend() 
plt.savefig('PyGraphs/Graphs-BuildingBlocks.pdf') 


m Script 1.21: Graphs-Export.py 
import scipy.stats as stats 


import numpy as np 
import matplotlib.pyplot as plt 


# support for all normal densities: 
x = np.linspace(-4, 4, num=100) 


# get different density evaluations: 


yl = stats.norm.pdf(x, 0, 1) 
y2 = stats.norm.pdf(x, 0, 3) 
# plot (a): 


plt.figure(figsize-(4, 6)) 
plt.plot(x, yl, linestyl ', color-'black') 
plt.plot(x, y2, linestyle-'--', color-'0.3') 
plt.savefig('PyGraphs/Graphs-Export-a.pdf') 

plt.close() 


# plot (b): 
plt.figure(figsize-(6, 4)) 
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plt.plot (x, yl, linestyle-'-', color-'black') 
plt.plot(x, y2, linestyle-'--', color-'0.3') 
plt.savefig('PyGraphs/Graphs-Export-b.png') 


m — — — Script 122: Descr-Tables.py 
import wooldridge as woo 


import numpy as np 
import pandas as pd 
affairs = woo.dataWoo('affairs') 


# adjust codings to [0-4] (Categoricals require a start from 0): 
affairs['ratemarr'] = affairs['ratemarr'] - 1 


# use a pandas.Categorical object to attach labels for "haskids": 


affairs['haskids'] = pd.Categorical.from_codes(affairs[’kids’], 
categories-['no', ‘yes’ ]) 
# ... and "marriage" (for example: 0 = ‘very unhappy’, 1 = ‘unhappy’, 
mlab = ['very unhappy’, ‘unhappy’, ‘average’, ‘happy’, ‘very happy’] 
affairs['marriage’] = pd.Categorical.from codes (affairs [’ratemarr’], 
categories-mlab) 


# frequency table in numpy (alphabetical order of elements): 
np.unique(affairs['marriage'], return counts-True) 
lem np - ft np[0] 

np = ft np[1] 

print(f'unique elem np: Wn(unique elem np)Vn') 
print(f'counts np: \n{counts_np}\n’) 


# frequency table in panda: 
ft pd - affairs['marriage'].value counts() 
print(f'ft pd: Wn(ft pd)Wn') 


# frequency table with groupby: 
ft pd2 = affairs['marriage'].groupby(affairs['haskids']).value counts() 
print(f'ft pd2: \n{ft_pd2}\n’) 


# contingency table in pandas: 

ct all abs = pd.crosstab(affairs['marriage'], affairs['haskids'], margin 
print(f'ct all abs: \n{ct_all_abs}\n’) 

ct all rel = pd.crosstab(affairs[’marriage’], affairs['haskids'], normalize-'all') 
print(f'ct all rel: \n{ct_all_rel}\n’) 


) 


# share within "marriage" (i.e. within a row): 
ct row = pd.crosstab(affairs['marriage'], affairs['haskids'], normalize-'index') 
print(f'ct row: \n{ct_row}\n’) 


# share within "haskids" (i.e. within a column): 
ct col - pd.crosstab(affairs['marriage'], affairs['haskids'], normalize-'columns') 
print(f'ct col: \n{ct_col}\n’) 


p Script 1.23: Descr-Figures.py 
import wooldridge as woo 
import numpy as np 

import pandas as pd 
import matplotlib.pyplot as plt 


affairs = woo.dataWoo('affairs') 
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# attach labels (see previous script): 


affairs['ratemarr'] = affairs['ratemarr'] - 1 

affairs['haskids'] = pd.Categorical.from codes (affairs['kids'], 
categories-['no', 'yes']) 

mlab = ['very unhappy’, ‘unhappy’, ‘average’, ‘happy’, ‘very happy'] 

affairs['marriage'] = pd.Categorical.from codes(affairs['ratemarr'], 
categories-mlab) 


# counts for all graphs: 

counts = affairs['marriage'].value counts() 

counts bykids = affairs [' marriage’ ] .groupby (affairs['haskids']).value counts() 
counts yes = counts bykids['yes'] 

counts no - counts bykids['no'] 


# pie chart (a): 

grey colors = ['0.3’, '0.4', '0.5', '0.6', '0.7'] 
plt.pie(counts, labels-mlab, colors-grey colors) 
plt.savefig('PyGraphs/Descr-Pie.pdf') 

plt.close() 


# horizontal bar chart (b): 

y_pos = [0, 1, 2, 3, 4] # the y locations for the bars 
plt.barh(y pos, counts, color-'0.6') 

plt.yticks(y pos, mlab, rotation-60) # add and adjust labeling 
plt.savefig('PyGraphs/Descr-Barl.pdf') 

plt.close() 


4 stacked bar plot (c): 

x pos = [0, 1, 2, 3, 4] # the x locations for the bars 
plt.bar(x pos, counts yes, width=0.4, color-'0.6', label-'Yes') 
4 with 'bottom-counts yes' bars are added on top of previous one: 
plt.bar(x pos, counts no, width-0.4, bottom-counts yes, color-'0.3', label-'No') 
plt.ylabel('Counts') 

plt.xticks(x pos, mlab) # add labels on x axis 

plt.legend() 

plt.savefig('PyGraphs/Descr-Bar2.pdf') 

plt.close() 


# grouped bar plot (d) 

# add left bars first and move bars to the left: 

x_pos_leftbar = [-0.2, 0.8, 1.8, 2.8, 3.8] 

plt.bar(x pos leftbar, counts yes, width-0.4, color-'0.6', label-'Yes') 
# add right bars first and move bars to the right: 

x pos rightbar = [0.2, 1.2, 2.2, 3.2, 4.2] 

plt.bar(x pos rightbar, counts no, width=0.4, color-'0.3', label=’No’) 
plt.ylabel('Counts') 

plt.xticks(x pos, mlab) 

plt.legend() 

plt.savefig('PyGraphs/Descr-Bar3.pdf') 


Script 1.24: Histogram.py 
import wooldridge as woo 
import matplotlib.pyplot as plt 


ceosall = woo.dataWoo('ceosall') 


# extract roe: 
roe = ceosall['roe'] 


1. Scripts Used in Chapter 01 


321 


# subfigure a (histogram with counts): 
plt hist (roe, color-'grey') 

plt .ylabel (’ Counts’) 

plt .xlabel (’ roe’) 

plt . savefig(’PyGraphs/Histogram1. pdf’ ) 
plt.close() 


# subfigure b (histogram with density and explicit breaks): 
breaks - [0, 5, 10, 20, 30, 60] 

plt.hist(roe, color-'grey', bins-breaks, density-True) 
plt.ylabel('density') 

plt .xlabel (’ roe’) 

plt.savefig('PyGraphs/Histogram2.pdf') 


Script 1.25: KDensity.py 


import wooldridge as woo 
import statsmodels.api as sm 
import matplotlib.pyplot as plt 


ceosall = woo.dataWoo('ceosall') 


# extract roe: 
roe = ceosall[’roe’] 


# estimate kernel density: 
kde = sm.nonparametric.KDEUnivariate (roe) 
kde. £it () 


# subfigure a (kernel density): 
plt.plot(kde.support, kde.density, color-'black', linewidth-2) 
plt.ylabel('density') 

plt.xlabel('roe') 

plt.savefig('PyGraphs/Densityl.pdf') 

plt.close() 


# subfigure b (kernel density with overlayed histogram): 
plt.hist(roe, color-'grey', density-True) 
plt.plot(kde.support, kde.density, color-'black', linewidth-2) 
plt.ylabel('density') 

plt.xlabel('roe') 

plt.savefig('PyGraphs/Density2.pdf') 


Script 1.26: Descr-ECDF.py 


import wooldridge as woo 
import numpy as np 
import matplotlib.pyplot as plt 


ceosall = woo.dataWoo('ceosall') 


# extract roe: 
roe = ceosall[’roe’] 


# calculate ECDF: 

x = np.sort (roe) 

x.size 

y 7 np.arange(1, n + 1) / n # generates cumulative shares of observations 


5 
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# plot a step function: 
plt.step(x, y, linestyle-'-', color-'black') 
plt.xlabel('roe') 
plt.savefig('PyGraphs/ecdf.pd£') 


MÀ — Script 1.27: Descr-Stats.py 
import wooldridge as woo 
import numpy as np 


ceosall = woo.dataWoo('ceosall') 


# extract roe and salary: 
roe = ceosall['roe'] 
salary = ceosall['salary'] 


# sample average: 
roe mean = np.mean(salary) 
print(f'roe mean: {roe_mean}\n’) 


# sample median: 
roe med = np.median (salary) 
print(f'roe med: (roe med)Wn') 


# standard deviation: 
roe s - np.std(salary, ddof-1) 
print(f'roe s: (roe s)Wn') 


# correlation with ROE: 
roe corr - np.corrcoef(roe, salary) 
print(f'roe corr: \n{roe_corr}\n’) 


Script 1.28: Descr-Boxplot.py 
import wooldridge as woo 
import matplotlib.pyplot as plt 


ceosall = woo.dataWoo('ceosall') 


# extract roe and salary: 
roe = ceosall['roe'] 
consprod - ceosall['consprod'] 


4 plotting descriptive statistics: 
plt.boxplot(roe, vert-False) 
plt.ylabel('roe') 
plt.savefig('PyGraphs/Boxplotl.pdf') 
plt.close() 


# plotting descriptive statistics: 
roe cpÜ = roe[consprod == 0] 
roe cpl = roe[consprod == 1] 


plt.boxplot([roe cp0, roe cpl]) 
plt.ylabel('roe') 
plt.savefig('PyGraphs/Boxplot2.pdf') 


Script 1.29: PMF-binom.py 


import scipy.stats as stats 
import math 
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4 pedestrian approach: 


pl =c * (0.2 ++ 2) « (0.8 «» 8) 
print(f'pl: {p1}\n’) 


# scipy function: 
p2 = stats.binom.pmf(2, 10, 0.2) 
print(f'p2: {p2}\n’) 


c = math.factorial(10) / (math.factorial(2) + math.factorial(10 - 2)) 


p Script 1.30: PMF-example.py 
import scipy.stats as stats 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 


# values for x (all between 0 and 10): 
x = np.linspace(0, 10, num-11) 


# PMF for all the: values: 
fx = stats.binom.pmf(x, 10, 0.2) 


# collect values in DataFrame: 
result = pd.DataFrame(('x': x, ‘fx’: fx)) 
print(f'result: \n{result}\n’) 


# plot: 

plt.bar(x, fx, color=’0.6’) 
plt.ylabel('x') 

plt.ylabel('fx') 
plt.savefig('PyGraphs/PMF-example.pdf') 


Script 1.31: PDF-example.py 
import scipy.stats as stats 


import numpy as np 
import matplotlib.pyplot as plt 


# support of normal density: 
x range - np.linspace(-4, 4, num-100) 


# PDF for all these values: 
pdf = stats.norm.pdf (x range) 


# plot: 

plt.plot(x range, pdf, linestyle-'-', color-'black') 
plt.xlabel('x') 

plt.ylabel('dx') 
plt.savefig('PyGraphs/PDF-example.pdf') 


p Script 1.32: CDF-example.py 
import scipy.stats as stats 


# binomial CDF: 
pl = stats.binom.cdf(3, 10, 0.2) 
print(f'pl: {p1}\n’) 


# normal CDF: 
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p2 = stats.norm.cdf(1.96) - stats.norm.cdf(-1.96) 
print(f'p2: (p2)Wn') 


M — — — — — — Script 1.33: Example-B-6.py 
import scipy.stats as stats 


# first example using the transformation: 
pl 1 = stats.norm.cdf(2 / 3) - stats.norm.cdf(-2 / 3) 
print(f'pl 1: (pl 1)Wn') 


# first example working directly with the distribution of X: 
pl_2 = stats.norm.cdf(6, 4, 3) - stats.norm.cdf(2, 4, 3) 
print(f'pl 2: (pl 2) ^n') 


# second example: 
p2 = 1 - stats.norm.cdf(2, 4, 3) + st 
print(f'p2: (p2)Wn') 


.norm.cdf(-2, 4, 3) 


Script 1.34: CDF-figure.py 
import scipy.stats as stats 
import numpy as np 
import matplotlib.pyplot as plt 


# binomial: 
# support of binomial PMF: 
x_binom = np.linspace(-1, 10, num=1000) 


# PMF for all th 
cdf binom sta! 


values: 
binom.cdf(x binom, 10, 0.2) 


# plot: 

plt.step(x binom, cdf binom, linestyl 
plt.xlabel('x') 

plt.ylabel('Fx') 
plt.savefig('PyGraphs/CDF-figure-discrete.pdf') 
plt.close() 


', colorz'black') 


# normal: 
# support of normal density: 
x norm = np.linspace(-4, 4, num=1000) 


# PDF for all these values: 
cdf norm = stats.norm.cdf (x norm) 


4 plot: 

plt.plot(x norm, cdf norm, linestyle-'-', color-'black') 
plt.xlabel('x') 

plt.ylabel('Fx') 
plt.savefig('PyGraphs/CDF-figure-cont.pdf') 


M Script 1.35: Quantile-example.py 
import scipy.stats as stats 


q 975 - stats.norm.ppf(0.975) 
print(f'q 975: {q_975}\n’) 
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I — — Script 1.36: smpl-bernoulli.py 
import scipy.stats as stats 


sample - stats.bernoulli.rvs(0.5, size-10) 
print(f'sample: (sample)in') 
Script 1.37: smpl-norm.py 


import scipy.stats as stats 


sample = stats.norm.rvs(size-10) 
print(f'sample: (sample) Wn') 


Script 1.38: Random-Numbers.py 
import numpy as np 
import scipy.stats as stats 


# sample from a standard normal RV with sample size n-5: 
samplel = stats.norm.rvs(size=5) 
print(f'samplel: {sample1}\n’) 


# a different sample from the same distribution: 
sample2 stats.norm. rvs (size=5) 
print (f’sample2: {sample2}\n’) 


# set the seed of the random number generator and take two samples: 
np. random. seed (6254137) 

sample3 = stats.norm.rvs(size=5) 

print (f’sample3: {sample3}\n’) 


sample4 = stats.norm.rvs(size=5) 
print (f’sample4: {sample4}\n’) 


# reset the seed to the same value to get the same samples again: 
np. random, seed (6254137) 

sample5 = stats.norm.rvs(size=5) 

print(f'sample5: {sample5}\n’) 


sample6 = stats.norm.rvs(size=5) 
print (f’sample6: (sample6)Wn') 


Script 1.39: Example-C-2.py 
import numpy as np 
import scipy.stats as stats 


# manually enter raw data from Wooldridge, Table C.3: 

SR87 - np.array([10, 1, 6, .45, 1.25, 1.3, 1.06, 3, 8.18, 1.67, 
.98, 1, .45, 5.03, 8, 9, 18, .28, 7, 3.97]) 

SR88 = np.array([3, 1, 5, .5, 1.54, 1.5, .8, 2, .67, 1.17, .51, 
.5, .61, 6.7, 4, 7, 19, .2, 5, 3.83]) 


# calculate change: 
Change = SR88 - SR87 


# ingredients to CI formula: 
avgCh = np.mean (Change) 
print (f’avgCh: {avgCh}\n’) 
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n = len(Change) 
sdCh = np.std(Change, ddof=1) 
se = sdCh / np.sqrt (n) 
print(f'se: {se}\n’) 


c = stats.t.ppf(0.975, n - 1) 
print(f'c: {c}\n’) 


# confidence interval: 
lowerCI - avgCh - c * se 
print(f'lowerCI: {lowerCI}\n’) 


upperCI = avgCh + c « se 
print (f/upperCI: {upperCI}\n’) 


p — —— Script 1.40: Example-C-3.py 
import wooldridge as woo 
import numpy as np 

import scipy.stats as stats 


audit = woo.dataWoo('audit') 
y = audit [’y’] 


# ingredients to CI formula: 
avgy = np.mean(y) 

n = len(y) 

sdy = np.std(y, ddof=1) 

se = sdy / np.sqrt (n) 

c95 = stats.norm.ppf(0.975) 
c99 - stats.norm.ppf(0.995) 


# 95% confidence interval: 
lowerCI95 - avgy - c95 « 
print(f'lowerCI95: (lowerCI95)Wn') 


upperCI95 = avgy + c95 * se 
print (f/upperCI95: (upperCI95)Wn') 


# 99% confidence interval: 
lowerCI99 = avgy - c99 * se 
print(f'lowerCI99: (lowerCI99)Wn') 


upperCI99 = avgy + c99 + se 
print (f/upperCI99: (upperCI99)Wn') 


Script 1.41: Critical-Values-t.py 
import numpy as np 
import pandas as pd 
import scipy.stats as stats 


# degrees of freedom = n-1: 
d£ = 19 


# significance levels: 
alpha one tailed - np.array([0.1, 0.05, 0.025, 0.01, 0.005, 
alpha two tailed = alpha one tailed + 2 


# critical values & table: 


.001]) 
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CV = stats.t.ppf(1 - alpha one tailed, df) 

table - pd.DataFrame(('alpha one tailed': alpha one tailed, 
'alpha two tailed': alpha two tailed, 'CV': CV)) 

print(f'table: \n{table}\n’) 


M — ~ Script 142: Example-C-5.py 
import wooldridge as woo 
import numpy as np 

import pandas as pd 

import scipy.stats as stats 


audit = woo.dataWoo('audit') 
y = audit['y'] 


# automated calculation of t statistic for HO (mu-0): 
test auto - stats.ttest lsamp(y, popmean-0) 

t auto = test auto.statistic # access test statistic 
p.auto = test auto.pvalue # access two-sided p value 
print(f't auto: (t auto)Wn') 

print(f'p auto/2: (p auto / 2)Wn') 


# manual calculation of t statistic for HO (mu-0): 
avgy = np.mean(y) 

n = len(y) 

sdy = np.std(y, ddof=1) 

se = sdy / np.sqrt(n) 

t manual - avgy / se 

print(f't manual: (t manual)Wn') 


# critical values for t distribution with n-1-240 d.f.: 

alpha one tailed - np.array([0.1, 0.05, 0.025, 0.01, 0.005, .001]) 

CV - stats.t.ppf(1 - alpha one tailed, 240) 

table - pd.DataFrame(('alpha one tailed': alpha one tailed, 'CV': CV)) 
print(f'table: \n{table}\n’) 


Script 1.43: Example-C-6.py 
import numpy as np 
import scipy.stats as stats 


# manually enter raw data from Wooldridge, Table C.3: 

SR87 - np.array([10, 1, 6, .45, 1.25, 1.3, 1.06, 3, 8.18, 1.67, 
.98, 1, .45, 5.03, 8, 9, 18, .28, 7, 3.97]) 

SR88 = np.array([3, 1, 5, .5, 1.54, 1.5, .8, 2, .67, 1.17, .51, 
5, .61, 6.7, 4, 7, 19, .2, 5, 3.83]) 

Change - SR88 - SR87 


# automated calculation of t statistic for HO (mu-0): 
test auto = stats.ttest_lsamp(Change, popmean=0) 
t_auto = test_auto.statistic 

p.auto = test auto.pvalue 

print(f't auto: (t auto)in') 

print(f'p auto/2: (p auto / 2}\n’) 


# manual calculation of t statistic for HO (mu-0): 
avgCh = np.mean (Change) 

n = len(Change) 

sdCh = np.std(Change, ddof=1) 

se = sdCh / np.sqrt(n) 
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t manual = avgCh / se 
print(f't manual: (t manual)Wn') 


# manual calculation of p value for HO (mu-0): 
p. manual = stats.t.cdf(t manual, n - 1) 
print(f'p manual: (p manual)Wn') 


Script 1.44: Example-C-7.py 
import wooldridge as woo 
import numpy as np 

import pandas as pd 

import scipy.stats as stats 


audit = woo.dataWoo (' audit’ ) 
y 7 audit['y'] 


# automated calculation of t statistic for HO (mu-0): 
test auto - stats.ttest lsamp(y, popmean-0) 

t auto - test auto.statistic 

p_auto = test auto.pvalue 

print(f't auto: (t auto)Wn') 

print(f'p auto/2: (p auto/2)Wn') 


# manual calculation of t statistic for HO (mu-0): 
avgy = np.mean(y) 

n = len(y) 

sdy = np.std(y, ddof=1) 

se = sdy / np.sqrt(n) 

t manual = avgy / se 

print(f't manual: {t_manual}\n’) 


# manual calculation of p value for HO (mu-0): 
p.manual = stats.t.cdf(t manual, n - 1) 
print(f'p manual: (p manual)Wn') 


Script 1.45: Adv-Loops.py 
seq = [1, 2, 3, 4, 5, 6] 
for i in seq: 
Afoh ed 
print (i +*+ 3) 


print (i ++ 2) 


Script 1.46: Adv-Loops2.py 
[1, 2, 3, 4, 5, 6] 
for i in range(len(seq)): 
if seq[i] < 4: 
print (seq[i] ** 3) 
else: 
print (seq[i] ** 2) 


M — Script 1.47: Adv-Functions.py 
# define function: 


= x ** 0.5 
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result = ‘You fool!’ 
return result 


# call function and save result: 
resultl = mysqrt (4) 
print(f'resultl: {result1}\n’) 


result2 = mysqrt (-1.5) 
print (f’result2: {result2}\n’) 


Script 148: Adv-ObjOr.py 
# use the predefined class ‘list’ to create an object: 
a 7 [2, 6, 3, 6] 


# access a local variable (to find out what kind of object we are dealing with): 
check = type(a). name . 
print(f'check: (check) Wn') 


of a method (how many 6 are in a?): 
7 a.count(6) 
print(f'count six: (count six)Wn') 


# use another method (sort data in a): 
a.sort() 
print(f'a: {a}\n’) 


Script 1.49: Adv-O0bjOr2.py 
import numpy as np 


# multiply these two matrices: 
a = np.array([[3, 6, 1], [2, 7, 411) 
b = np.array([[1, 8, 6], [3, 5, 8], [1, 1, 211) 


# the numpy way: 
result np - a.dot(b) 
print(f'result np: \n{result_np}\n’) 


# or, do it yourself by defining a class: 
class myMatrices 
def init (self, A, B): 
self.A =A 
self.B = B 


def mult (self) : 
N = self.A.shape[0] # number of rows in A 
K = self.B.shape[1] # number of cols in B 
out = np.empty((N, K)) # initialize output 
for i in range(N): 
for j in range(K): 
out[i, j] = sum(self.A[i, :] * self.B[:, j]) 
return out 


# create an object: 
test - myMatrices(a, b) 


# access local variables: 
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print (f’test.A: \n{test.A}\n’) 
print (f’test.B: \n{test.B}\n’) 


# use object method: 
result own = test .mult () 
print(f'result own: \n{result_own}\n’) 


Script 1.50: Adv-ObjOr3.py 
import numpy as np 


# multiply these two matrices: 
a = np.array([[3, 6, 1], [2, 7, 4]]) 
b = np.array([[1, 8, 6], [3, 5, 8], [1, 1, 2]]) 


# define your own class: 
class myMatrices: 
def init (self, A, B): 
self.A =A 
self.B = B 


def mult (self): 

N = self.A.shape[0] # number of rows in A 
shape[1] # number of cols in B 
out = np.empty((N, K)) # initialize output 
for i in range(N): 

for j in range(K): 

out[i, j] = sum(self.A[i, :] + self.B[:, j]) 

return out 


# define a subclass: 
class myMatNew (myMatrice: 
def getTotalElem(self): 
N = self.A.shape[0] # number of rows in A 
K = self.B.shape[1] # number of cols in B 
return N * K 


4 create an object of the subclass: 
test - myMatNew(a, b) 


# use a method of myMatrices: 
result own - test.mult() 
print(f'result own: \n{result_own}\n’) 


# use a method of myMatNew: 
totalElem = test.getTotalElem() 
print(f'totalElem: {totalElem}\n’) 


M — — — — — Script 1.51: Simulate-Estimate.py 
import numpy as np 
import scipy.stats as stats 


# set the random seed: 
np.random.seed(123456) 


# set sample size: 
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n = 100 


# draw a sample given the population parameters: 
samplel = stats.norm.rvs(10, 2, size=n) 


# estimate the population mean with the sample average: 
estimatel = np.mean(samplel) 
print (f’estimatel: {estimatel}\n’) 


# draw a different sample and estimate again: 
sample2 = stats.norm.rvs(10, 2, size=n) 
estimate2 = np.mean(sample2) 

print (f/estimate2: (estimate2)Wn') 


# draw a third sample and estimate again: 
sample3 = stats.norm.rvs(10, 2, size=n) 
estimate3 = np.mean(sample3) 

print (f’estimate3: (estimate3)n') 


Script 1.52: Simulation-Repeated.py 
import numpy as np 
import scipy.stats as stats 


# 
np. 


t the random seed: 
indom. id (123456) 


# set sample size: 
n - 100 


# initialize ybar to an array of length r-10000 to later store result: 
r - 10000 
ybar = np.empty(r) 


# repeat r times: 

for j in range(r): 
# draw a sample and store the sample mean in pos. j-0,1,... of ybar: 
sample = stats.norm.rvs(10, 2, size=n) 
ybar[j] 7 np.mean(sample) 


I Script 1.53: Simulation-Repeated-Results.py 
import numpy as np 
import statsmodels.api as sm 
import scipy.stats as stats 
import matplotlib.pyplot as plt 


# set the random seed: 
np. random. seed (123456) 


# set sample size: 
n = 100 


# initialize ybar to an array of length r=10000 to later store results: 
x = 10000 


ybar = np.empty(r) 


# repeat r times: 
for j in range(r): 
# draw a sample and store the sample mean in pos. j-0,1,... of ybar: 
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sample = stats.norm.rvs(10, 2, size=n) 
ybar[j] = np.mean (sample) 


# the first 20 of 10000 estimates: 
print(f'ybar[0:19]: \n{ybar[0:19]}\n’) 


# simulated mean: 
print(f'np.mean(ybar): (np.mean(ybar))An') 


# simulated variance: 
print(f'np.var(ybar, ddof-1): (np.var(ybar, ddof=1)}\n’) 


# simulated density: 
kde = sm.nonparametric.KDEUnivariate (ybar) 
kde. fit () 


# normal density: 
x range = np.linspace(9, 11) 
y 7 stats.norm.pdf(x range, 10, np.sqrt(0.04)) 


# create graph: 

plt.plot (kde. support, kde.density, color-'black', label-'ybar') 

plt.plot(x range, y, linestyle-'--', color-'black', label-'normal distribution’) 
plt.ylabel('density') 

plt.xlabel('ybar') 

plt.legend() 

plt.savefig('PyGraphs/Simulation-Repeated-Results.pdf') 


Script 1.54: Simulation-Inference-Figure.py 
import numpy as np 
import scipy.stats as stats 
import matplotlib.pyplot as plt 


* t the random si t 
np.random. Seed (123456) 


# set sample size and MC simulations: 
10000 
100 


# initialize arrays to later store results: 


CIlower - np.empty(r) 
CIupper = np.empty(r) 
pvaluel - np.empty(r) 
pvalue2 - np.empty(r) 


4 repeat r times: 
for j in range(r): 
# draw a sample: 
sample = stats.norm.rvs(10, 2, size=n) 
sample mean = np.mean (sample) 
sample sd - np.std(sample, ddof-1) 
# test the (correct) null hypothesis mu-10: 
testresl = stats.ttest lsamp(sample, popmean=10) 
pvaluel[j] = testresl.pvalue 
cv = stats.t.ppf(0.975, df-n - 1) 
CIlower[j] = sample mean - cv + sample sd / np.sqrt(n) 
CIupper[j] = sample mean + cv + sample sd / np.sqrt (n) 
# test the (incorrect) null hypothesis mu-9.5 & store the p value: 


1. Scripts Used in Chapter 01 333 


testres2 = stats.ttest_lsamp(sample, popmean-9.5) 
pvalue2[j] = testres2.pvalue 


AHHHHHHHHHHHHHHHHHE 
## correct HO ## 


AHHHHHHHHHHHHHHHHHE 


plt.figure(figsize-(3, 5)) # set figure ratio 
plt.ylim(0, 101) 
plt.xlim(9, 11) 
for j in range(1, 101): 
if 10 > CIlower[j] and 10 < CIupper[j]: 
plt.plot([CIlower[j], CIupper[j]], [j, j], linestyle-'-', color-'grey') 
else: 
plt.plot([CIlower[j], CIupper[j]], [j, jl], linestyle-'-', color-'black') 
plt.axvline(10, linestyle-'--', color-'black', linewidth=0.5) 
plt.ylabel('Sample No.') 
plt.savefig('PyGraphs/Simulation-Inference-Figurel.pdf') 


AHHHHHHHHHHHHHHHHRE 
## incorrect HO ## 
JHHHHHHHHHHHHHRHHNE 


plt. figure (figsiz 
plt.ylim(0, 101) 
plt.xlim(9, 11) 
for j in range(1, 101): 
if 9.5 > CIlower[j] and 9.5 < CIupper[j]: 
plt.plot([CIlower[j], CIupper[j]], [j, j], linestyl 
else: 
plt.plot([CIlower[j], CIupper[j]], [j, jl, linestyle-'-', color-'black') 
plt.axvline(9.5, linestyle-'--', color-'black', linewidth-0.5) 
plt.ylabel('Sample No.') 
plt.savefig('PyGraphs/Simulation-Inference-Figure2.pdf') 


(3, 5)) # set figure ratio 


, color='grey’) 


Script 1.55: Simulation-Inference.py 


import numpy as np 
import scipy.stats as stats 


# set the random seed: 
np. random. seed (123456) 


# set sample size and MC simulations: 
r = 10000 
n = 100 


# initialize arrays to later store results: 
CIlower = np.empty (r) 
CIupper = np.empty (r) 
pvaluel = np.empty (r) 
pvalue2 = np.empty (r) 


# repeat r times: 

for j in range(r): 
# draw a sample: 
sample = stats.norm.rvs(10, 2, size=n) 
sample mean = np.mean (sample) 
sample_sd = np.std(sample, ddof=1) 
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# test the (correct) null hypothesis mu=10: 
testresl = stats.ttest_lsamp(sample, popmean=10) 
pvaluel[j] = testresl.pvalue 

cv = stats.t.ppf(0.975, df=n - 1) 

CIlower[j] = sample mean - cv + sample sd / np.sqrt(n) 

CIupper[j] = sample mean + cv * sample sd / np.sqrt(n) 

# test the (incorrect) null hypothesis mu-9.5 & store the p value: 
testres2 = stats.ttest lsamp(sample, popmean-9.5) 

pvalue2[j] = testres2.pvalue 


4 test results as logical value: 
rejectl = pvaluel <= 0.05 

countl true = np.count nonzero(rejectl) # counts true 
countl false = r - countl true 

print(f'countl true: {count1_true}\n’) 

print(f'countl false: {count1_false}\n’) 


reject2 - pvalue2 «- 0.05 

count2 true = np.count nonzero(reject2) 
count2 false = r - count2 true 
print(f'count2 true: (count2 true) Wn') 
print(f'count2 false: {count2_false}\n’) 


2. Scripts Used in Chapter 02 


Script 2.1: Example-2-3.py 
import wooldridge as woo 
import numpy np 


ceosall = woo.dataWoo('ceosall') 
x = ceosall['roe'] 
y = ceosall['salary'] 


# ingredients to the OLS formulas: 

cov xy = np.cov(x, y)[1, 0] # access 2. row and 1. column of covariance matrix 
np.var(x, ddof-1) 

= np.mean (x) 

= np.mean (y) 


# manual calculation of OLS coefficients: 
bl = cov xy / var x 

b0 = y bar - bl * x bar 

print(f'bl: (bl)in') 

print (f'b0: (b0)in') 


m~~ Script 2.2: Example-2-3-2.py 
import wooldridge as woo 
import statsmodels.formula.api as smf 


ceosall = woo.dataWoo ('ceosall') 


reg = smf.ols(formula-'salary ~ roe’, 
results = reg. fit () 

b = results.params 

print(f'b: \n{b}\n’) 
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Script 2.3: Example-2-3-3.py 
import wooldridge as woo 
import statsmodels.formula.api as smf 
import matplotlib.pyplot as plt 


ceosall = woo.dataWoo('ceosall') 


# OLS regression: 
reg = smf.ols(formula-'salary ~ roe’, data-ceosall) 
results = reg. fit() 


# scatter plot and fitted values: 
plt.plot('roe', ‘salary’, data-ceosall, color-'grey', marker-'o', linestyle-'') 
Plt .plot (ceosall[’roe’], results.fittedvalues, color-'black', linestyle-'-') 
plt.ylabel('salary') 

plt.xlabel('roe') 

plt.savefig('PyGraphs/Example-2-3-3.pdf') 


Script 24: Example-2-4.py 
import wooldridge as woo 
import statsmodels.formula.api as smf 


wagel - woo.dataWoo('wagel') 


reg = smf.ols(formula-'wage ~ educ’, data-wagel) 
reg.fit() 

results.params 

print(f'b: \n{b}\n’) 


Script 2.5: Example-2-5.py 
import wooldridge as woo 
import statsmodels.formula.api as smf 
import matplotlib.pyplot as plt 


votel - woo.dataWoo('votel') 


# OLS regression: 

reg = smf.ols(formula-'voteA ~ shareA’, data-votel) 
results = reg. fit() 

b = results.params 

print(f'b: \n{b}\n’) 


# scatter plot and fitted values: 
plt.plot('shareA', 'voteA', data=votel, color-'grey', marker-'o', linestyle-'') 
plt.plot(votel['shareA'], results.fittedvalues, color-'black', linestyle-'-') 
plt.ylabel('voteA') 

plt .xlabel (’ sharea’ ) 

plt.savefig('PyGraphs/Example-2-5.pdf') 


Script 2.6: Example-2-6.py 
import wooldridge as woo 
import pandas as pd 
import statsmodels.formula.api as smf 


ceosall = woo.dataWoo('ceosall') 


# OLS regression: 
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reg = smf.ols(formula-'salary ~ roe’, data-ceosall) 
results = reg. fit () 


# obtain predicted values and residuals: 
salary hat = results. fittedvalues 
u_hat = results.resid 


# Wooldridge, Table 2.2: 

table = pd.DataFrame({’ roe’: ceosall['roe'], 
‘salary’: ceosall['salary'], 
‘salary hat’: salary hat, 
'u hat’: u_hat}) 

print(f'table.head(15): \n{table.head(15)}\n’) 


Script 2.7: Example-2-7.py 
import wooldridge as woo 
import numpy as np 
import statsmodels.formula.api as smf 


wagel = woo.dataWoo (’wagel’) 
reg = smf.ols(formula-'wage ~ educ’, data-wagel) 
results = reg. fit () 


# obtain coefficients, predicted values and residual. 
b = results.param 

wage hat - results.fittedvalues 

u hat = results.resi 


# confirm property (1): 
u hat mean - np.mean(u hat) 
print(f'u hat mean: (u hat mean)Vn') 


# confirm property (2): 
educ u cov = np.cov(wagel[’educ’], u_hat)[1, 0] 
print(f'educ u cov: (educ u cov) n') 


# confirm property (3): 

educ mean = np.mean (wagel[' educ’ ]) 
wage pred - b[0] * b[1] * educ mean 
print(f'wage pred: (wage pred)Wn') 


wage mean = np.mean (wagel['wage']) 
print(f'wage mean: (wage mean)Wn') 


Script 2.8: Example-2-8.py 


import wooldridge as woo 


import numpy as np 
import statsmodels.formula.api as smf 


ceosall = woo.dataWoo('ceosall') 


# OLS regression: 
reg = smf.ols(formula=’ salary ~ roe’, data-ceosall) 
results = reg. fit () 


# calculate predicted values & residuals: 
sal_hat = results.fittedvalues 
u_hat = results.resid 
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# calculate R*2 in three different ways: 

sal = ceosall['salary'] 

R2_a = np.var(sal_hat, ddof=1) / np.var(sal, ddof=1) 
R2 b = 1 - np.var(u hat, ddof-1) / np.var(sal, ddof-1) 
R2 c = np.corrcoef(sal, sal hat)[1, 0] ++ 2 


print(f'R2 a: {R2_a}\n’) 
print(f'R2 b: {R2_b}\n’) 
print(f'R2 c: {R2_c}\n’) 


Script 2.9: Example-2-9.py 


import wooldridge as woo 
import pandas as pd 
import statsmodels.formula.api as smf 


votel - woo.dataWoo('votel') 


# OLS regression: 
reg = smf.ols(formula-'voteA ~ shareA’, data-votel) 
results = reg. fit() 


# print results using summary: 
print(f'results.summary(): \n{results.summary()}\n’) 


# print regression tabl 
table = pd.DataFrame(('b': round(results.params, 4), 
id : round(results.bse, 4), 
't': round(results.tvalues, 4), 
^pval': round (re: 
\n{table}\n’) 


4) 


print(f'table 


Script 2.10: Example-2-10.py 
import numpy as np 
import wooldridge as woo 
import statsmodels.formula.api as smf 


wagel - woo.dataWoo('wagel') 


# estimate log-level model: 
reg = smf.ols(formula-'np.log(wage) ~ educ', data-wagel) 
results - reg.fit() 

b = results.params 

print(f'b: \n{b}\n’) 


Script 2.11: Example-2-11.py 
import numpy as np 
import wooldridge as woo 
import statsmodels.formula.api as smf 


ceosall = woo.dataWoo('ceosall') 
# estimate log-log model: 
results = reg. fit() 


b = results.params 
print (£’b: \n{b}\n’) 


reg = smf.ols(formula-'np.log(salary) ~ np.log(sales)', data-ceosall) 
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Script 212: SLR-0rigin-Const.py 
import wooldridge as woo 
import numpy as np 
import statsmodels.formula.api as smf 
import matplotlib.pyplot as plt 


ceosall = woo.dataWoo ('ceosall') 


# usual OLS regression: 

regl = smf.ols(formula-'salary ~ roe’, data-ceosall) 
resultsl = regl.fit() 

b 1 = resultsl.params 

print(f'b 1: \n{b_1}\n’) 


# regression without intercept (through origin): 
reg2 = smf.ols(formula-'salary ~ 0 + roe’, data-ceosall) 
results2 = reg2.fit() 

b 2 = results2.params 

print(f'b 2: \n{b_2}\n’) 


ion without slope (on a constant 
reg3 = smf.ols(formula-'salary ~ 1', data-ceosall) 
reg3.fit() 

T ults3.params 

print(f'b 3: \n{b_3}\n’) 


# average y: 
sal mean - np.mean(ceosall['salary']) 
print(f'sal mean: (sal mean)|n') 


# scatter plot and fitted valu 
plt.plot('roe', ‘salary’, data= 


ceosall, color=’grey’, marker-'o', 


linestyle-'', (n) 
plt.plot(ceosall['roe'], resultsl.fittedvalues, 
, label-'full') 
plt.plot(ceosall['roe'], results2.fittedvalues, 
i :', label-'through origin’) 
plt.plot(ceosall['roe'], results3.fittedvalues, 


linestyle-' 


.', labelz'const only’) 


colorz'black', 
color=’black’, 


color=’black’, 


plt.ylabel('salary') 

plt.xlabel('roe') 

plt.legend() 
plt.savefig('PyGraphs/SLR-Origin-Const.pdf') 


Script 2.13: Example-2-12.py 
import numpy as np 
import wooldridge as woo 
import statsmodels.formula.api as smf 


meap93 = woo.dataWoo(’meap93’ ) 

# estimate the model and save the results as "results": 
reg = smf.ols(formula=’math10 ~ lnchprg', data=meap93) 
results = reg. fit () 


# number of obs.: 
n = results.nobs 


# SER: 
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u_hat_var = np.var(results.resid, ddof=1) 
SER = np.sqrt(u hat var) * np.sqrt((n - 1) / (n - 2)) 
print(f'SER: {SER}\n’) 


# SE of b0 & bl, respectively: 
lnchprg sq mean = np.mean(meap93['lnchprg'] ++ 2) 
lnchprg var = np.var(meap93['lnchprg'], ddof=1) 
bl se = SER / (np.sqrt (1nchprg var) 
* np.sqrt(n - 1)) + np.sqrt(lnchprg sq mean) 
b0 se = SER / (np.sqrt(lnchprg var) * np.sqrt(n - 1)) 
print(f'bl se: (bl se)in') 
print(f'b0 se: (b0 se)Wn') 


# automatic calculations: 
print(f'results.summary(): \n{results.summary () }\n’ ) 


Script 2.14: SLR-Sim-Sample.py 
import numpy as np 
import pandas as pd 
import statsmodels.formula.api as smf 
import scipy.stats as stats 
import matplotlib.pyplot as plt 


# set the random 
np. random. seed (1234567) 


# 
n 


t sample size: 
1000 


# set true parameters (betas and sd of u): 


beta0 1 
betal - 0.5 
su 2 


# draw a sample of size n: 

(4, 1, size=n) 
norm.rvs(0, su, size-n) 
y = betaO + betal + x + u 

df = pd.DataFrame(('y': y, ‘x’: x}) 


# estimate parameters by OLS: 

reg = smf.ols(formula-'y ~ x’, data-df) 
results = reg. fit() 

b = results.params 

print(f'b: \n{b}\n’) 


# features of the sample for the variance formula: 
X Sq mean = np.mean(x ++ 2) 

print(f'x sq mean: (x sq mean)in') 

x var - np.sum((x - np.mean(x)) ** 2) 

print(f'x var: (x var) Wn') 


# graph: 

x range = np.linspace(0, 8, num=100) 

plt.ylim([-2, 10]) 

plt.plot(x, y, color-'lightgrey', marker-'o', linestyle-'') 

plt.plot(x range, beta0 + betal + x range, color-'black', 
linestyle-'-', linewidth-2, label=’pop. regr. fct.') 

plt.plot(x range, b[0] + b[1] * x range, color-'grey', 
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linestyle-'-', linewidth-2, labe: 
plt.ylabel('y') 
plt.xlabel('x') 
plt.legend() 
plt.savefig('PyGraphs/SLR-Sim-Sample.pdf') 


OLS regr. fct.') 


Script 2.15: SLR-Sim-Model.py 
import numpy as np 
import pandas as pd 
import statsmodels.formula.api as smf 
import scipy.stats as stats 


# set the random seed: 
np. random. seed (1234567) 


# set sample size and number of simulations 
n = 1000 
r = 10000 


# set true paramete: 
beta0 = 1 


betal 5 
su=2 
sx 1 


ex 4 


# initialize b0 and bl to store results late: 
b0 = np.empty (r) 
bl = np. empty (r) 


# repeat r times: 

for i in range(r 
4 draw a sample 

stats.norm.rvs(ex, sx, siz 

stats.norm.rvs(0, su, siz 

y = beta0 + betal + x +u 

df - pd.DataFrame(('y': y, 'x': x)) 


# estimate OLS: 

reg = smf.ols(formula-'y ~ x’, data-df) 
results = reg. fit () 

b0[i] = results.params[’ Intercept’ ] 
bl[i] = results.params['x'] 


Script 2.16: SLR-Sim-Model-Condx.py 
import numpy as np 
import pandas as pd 
import statsmodels.formula.api as smf 
import scipy.stats as stats 
import matplotlib.pyplot as plt 


# set the random seed: 
np.random.seed(1234567) 


# set sample size and number of simulations: 
n - 1000 
10000 
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# set true parameters (betas and sd of u): 
betad = 1 

betal = 0.5 

su =2 


# initialize b0 and bl to store results later: 
b0 = np.empty(r) 
bl = np.empty(r) 


# draw a sample of x, fixed over replications: 
x = stats.norm.rvs(4, 1, size-n) 


# repeat r times: 
for i in range(r): 

# draw a sample of y: 
stats.norm.rvs(0, su, size-n) 
y = beta0 + betal + x +u 
df = pd.DataFrame(('y': y, 'x': x}) 


# estimate and store parameters by OLS: 
smf.ols(formula-'y ~ x’, data-df) 
reg.fit() 

results.params[‘ Intercept’ ] 
results.params['x'] 


timate of the expected values: 
b0 mean = np.mean (b0) 
bl mean = np.mean (b1) 


print(f'b0 mean: {b0_mean}\n’) 
print(f'bl mean: (bl mean)|n') 


# MC estimate of the variances: 
b0 var = np.var(b0, ddof=1) 
bl var = np.var(bl, ddof=1) 
print(f'b0 var: (b0 var)Wn') 
print(f'bl var: (bl var)Wn') 


# graph: 
x range = np.linspace(0, 8, num=100) 
plt.ylim([0, 6]) 


# add population regression line: 
plt.plot(x range, beta0 + betal + x range, color-'black', 
linestyle-'-', linewidth-2, label-'Population') 


# add first OLS regression line (to attach a label): 
plt.plot(x range, bO[0] + bl[0] + x range, color-'grey', 
linestyle-'-', linewidth-0.5, label-'OLS regressions’) 


# add OLS regression lines no. 2 to 10: 
for i in range(1, 10): 
plt.plot(x range, bO[i] + bl[i] + x range, color-'grey', 
linestyle-'-', linewidth-0.5) 
plt.ylabel('y') 
plt.xlabel('x') 
plt.legend() 
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plt.savefig('PyGraphs/SLR-Sim-Model-Condx.pdf') 


Script 217: SLR-Sim-Model-ViolSLR4.py 


import numpy as np 
import pandas as pd 
import statsmodels.formula.api as smf 


import scipy.stats as stats 


# set the random seed: 
np.random.seed(1234567) 


# set sample size and number of simulations: 
n - 1000 
r - 10000 


# set true parameters (betas and sd of u): 
beta0 = 1 

betal = 0.5 

su=2 


# initialize b0 and bl to store results later: 
b0 = np.empty(r) 
bl = np.empty (r) 


# draw a sample of x, fixed over replications: 
x = stats.norm.rvs(4, 1, size=n) 


# repeat r times: 
for i in range(r): 
# draw a sample of y: 
u mean = np.array((x - 4) / 5) 
7 stats.norm.rvs(u mean, su, size-n) 


if = pd.DataFrame(('y': y, 'x': x)) 


# estimate and store parameters by OLS: 
reg - smf.ols(formula-'y - x', data-df) 
= reg.fit() 

bO[i] = results.params['Intercept'] 
bl[i] = results.params['x'] 


4 MC estimate of the expected values: 
b0 mean = np.mean (b0) 
bl mean = np.mean (bl) 


print(f'b0 mean: {b0_mean}\n’) 
print(f'bl mean: (bl mean)Win') 


# MC estimate of the variances: 
b0 var = np.var(b0, ddof=1) 
bl var = np.var(bl, ddof=1) 


print(f'b0 var: (b0 var) Wn') 
print(f'bl var: (bl var] Wn') 


Script 2.18: SLR-Sim-Model-ViolSLR5.py 


import numpy as np 
import pandas as pd 
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import statsmodels.formula.api as smf 
import scipy.stats as stats 


# set the random seed: 
np. random. seed (1234567) 


# set sample size and number of simulations: 
n = 1000 
x = 10000 


# set true parameters (betas): 
betad = 1 
betal = 0.5 


# initialize b0 and bl to store results later: 
bO = np.empty(r) 
bl = np.empty(r) 


# draw a sample of x, fixed over replications: 
x = stats.norm.rvs(4, 1, size=n) 


# repeat r times: 

for i in range(r): 
# draw a sample of y: 
u_var = np.array(4 / np.exp(4.5) * np.exp(x)) 
u = stats.norm.rvs(0, np.sqrt(u var), size=n) 
y = beta0 + betal + x + u 
df - pd.DataFrame(('y': y, 'x': x)) 


# estimate and store parameters by OLS: 
reg - smf.ols(formula-'y - x', data-df) 
results = reg.fit() 

results = reg.fit() 

bO[i] = results.params['Intercept'] 
bi[i] = results.params['x'] 

# MC estimate of the expected values: 
b0 mean = np.mean (b0) 

bl mean = np.mean(bl) 


print(f'b0 mean: {b0_mean}\n’) 
print(f'bl mean: (bl mean)|n') 


# MC estimate of the variances: 
b0 var = np.var(b0, ddof=1) 
bl var = np.var(bl, ddof=1) 


print(f'b0 var: {b0_var}\n’) 
print(f'bl var: (bl var)Wn') 
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Script 3.1: Example-3-1.py 


import wooldridge as woo 
import statsmodels.formula.api as smf 
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gpal = woo.dataWoo('gpal') 


reg = smf.ols(formula-'colGPA ~ hsGPA + ACT’, data-gpal) 
results = reg.fit() 
print (f’ results.summary(): \n{results.summary()}\n’) 


Script 3.2: Example-3-2.py 
import wooldridge as woo 


import numpy as np 
import statsmodels.formula.api as smf 


wagel = woo.dataWoo('wagel') 
reg = smf.ols(formula-'np.log(wage) - educ + exper + tenure’, data-wagel) 


results = reg.fit() 
print(f'results.summary(): \n{results.summary()}\n’) 


M — Script 3.3: Example-3-3.py 
import wooldridge as woo 
import numpy as np 

import statsmodels.formula.api as smf 


k401k = woo.dataWoo(’ 401k’) 
reg = smf.ols(formula-'prate ~ mrate + age’, data=k401k) 


results = fit() 
print(f'results.summary(): \n{results.summary()}\n’) 


- Script 3.4: Example-3-5a.py 
import wooldridge as woo 
import statsmodels.formula.api as smf 


crimel = woo.dataWoo('crimel') 


4 model without avgsen: 

reg = smf.ols(formula-'narr86 ~ pcnv + ptime86 + qemp86', data-crimel) 
results = reg.fit() 

print (f’ results.summary(): \n{results.summary()}\n’) 


p — Script3.5: Example-3-5b.py 
import wooldridge as woo 
import statsmodels.formula.api as smf 


crimel = woo.dataWoo('crimel') 
# model with avgsen: 
reg = smf.ols(formula-'narr86 ~ pcnv + avgsen + ptime86 + qemp86', data-crimel) 


results - reg.fit() 
print(f'results.summary(): \n{results.summary()}\n’) 


Script 3.6: Example-3-6.py 


import wooldridge as woo 


import numpy as np 
import statsmodels.formula.api as smf 


wagel = woo.dataWoo(’wagel’ ) 
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reg = smf.ols(formula-'np.log(wage) ~ educ’, data-wagel) 
results = reg. fit () 
print (f' results.summary(): \n{results.summary()}\n’) 


Script 3.7: OLS-Matrices.py 
import wooldridge as woo 
import numpy as np 
import pandas as pd 
import patsy as pt 


gpal = woo.dataWoo('gpal') 


determine sample size & no. of regressors: 
len(gpal) 
2 


= gpal['colGPA'] 


xtract X & add a column of ones: 
= pd.DataFrame(('const': 1, 'hsGPA': gpal['hsGPA'], ‘ACT’: gpal['ACT']]) 


LÀ 
n 
k 
# extract y: 
Y 
# 
x 


# alternative with patsy: 
y2, X2 = pt.dmatrices(’colGPA ~ hsGPA + ACT’, data-gpal, return type-'dataframe') 


# display first rows of X: 
print (f’X.head(): \n{X.head()}\n’) 


# parameter estimat: 
np.array (X) 
np.array(y).reshape(n, 1) # creates a row vector 
b = np.linalg.inv(X.T @ X) @ X.T @ y 

print(f'b: \n{b}\n’) 


# residuals, estimated variance of u and SER: 
y-X@b 

hat = (u hat.T @ u hat) / (n - k - 1) 

np.sqrt(sigsq hat) 

print(f'SER: (SER)Wn') 


# estimated variance of the parameter estimators and SE: 
Vbeta_hat = sigsq hat + np.linalg.inv(X.T @ X) 

se = np.sqrt (np.diagonal(Vbeta hat)) 

print (f’se: {se}\n’) 


Script 3.8: Omitted-Vars.py — 4 


import statsmodels.formula.api as smf 
gpal = woo.dataWoo(’ gpal’) 


# parameter estimates for full and simple model: 

reg = smf.ols(formula-'colGPA ~ ACT + hsGPA’, data-gpal) 
results = reg. fit() 

b = results.params 

print(f'b: \n{b}\n’) 
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# relation between regressors: 

reg delta = smf.ols(formula-'hsGPA ~ ACT’, data-gpal) 
results delta - reg delta.fit() 

delta tilde - results delta.params 

print(f'delta tilde: \n{delta_tilde}\n’) 


# omitted variables formula for bl tilde: 
bl tilde = b[’ACT’] + b['hsGPA'/] + delta tilde['ACT'] 
print(f'bl tilde: Wn(bl tilde)in') 


# actual regression with hsGPA omitted: 

reg om - smf.ols(formula-'colGPA - ACT', data-gpal) 
results om - reg om.fit() 

b om - results om.params 

print(f'b om: Mn(b om) An') 


Script 3.9: MLR-SE.py 


import wooldridge as woo 
import numpy as np 
import statsmodels.formula.api as smf 


gpal = woo.dataWoo('gpal') 
# full estimation results including automatic SE: 


reg = smf.ols(formula-'colGPA - hsGPA + ACT’, data-gpal) 
results - reg.fit() 


# extract SER (instead of calculation via residuals): 
SER = np.sqrt(results.mse resid) 


4 regressing hsGPA on ACT for calculation of R2 & VIF: 
reg hsGPA = smf.ols(formula-'hsGPA ~ ACT’, data-gpal) 
results hsGPA - reg hsGPA.fit() 

R2 hsGPA = results hsGPA.rsquared 

VIF hsGPA = 1 / (1 - R2 hsGPA) 

print(f'VIF hsGPA: (VIF hsGPA)Wn') 


4 manual calculation of SE of hsGPA coefficient: 

n = results.nobs 

sdx = np.std(gpal['hsGPA'], ddof-1) * np.sqrt((n - 1) / n) 
SE hsGPA = 1 / np.sqrt(n) * SER / sdx * np.sqrt(VIF hsGPA) 
print(f'SE hsGPA: (SE hsGPA)Wn') 


Script 3.10: MLR-VIF.py 
import wooldridge as woo 

import numpy as np 

import statsmodels.stats.outliers influence as smo 
import patsy as pt 


wagel = woo.dataWoo('wagel') 


# extract matrices using patsy: 
y, X = pt.dmatrices(/np.log(wage) ~ educ + exper + tenure’, 
data-wagel, return type-'dataframe') 


# get VIF: 
K - X.shape[1] 
VIF - np.empty(K) 
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for i in range(K): 
VIF[i] = smo.variance inflation factor(X.values, i) 
print(f'VIF: \n{VIF}\n’) 


4. Scripts Used in Chapter 04 


LLL Script 4.1: Example-4-3-cv.py 
import scipy.stats as stats 
import numpy as np 


# CV for alpha=5% and 1% using the t distribution with 137 d.f.: 
alpha = np.array([0.05, 0.01]) 

cv t - stats.t.ppf(1 - alpha / 2, 137) 

print(f'cv t: (cv t)Wn') 


# CV for alpha-5* and 1% using the normal approximation: 
cv n = stats.norm.ppf(1 - alpha / 2) 
print(f'cv n: (cv n)Wn') 


Script 42: Example-4-3.py 
import wooldridge as woo 
import statsmodels.formula.api as smf 
import scipy.stats as stats 


gpal = woo.dataWoo(’ gpal’) 


# store and display results: 

reg = smf.ols(formula-'colGPA ~ hsGPA + ACT + skipped’, data-gpal) 
results = reg. fit () 

print (f' results.summary(): \n{results.summary()}\n’) 


# manually confirm the formulas, i.e. extract coefficients and SE: 
b 


# reproduce t statistic: 
tstat = b / se 
print (f/tstat: \n{tstat}\n’) 


# reproduce p value: 
pval = 2 + stats.t.cdf(-abs(tstat), 137) 
print(f'pval: \n{pval}\n’) 


p — — —— Script 4.3: Example-4-1-cv.py 
import scipy.stats as stats 
import numpy as np 


# CV for alpha=5% and 1% using the t distribution with 522 d.f.: 
alpha - np.array([0.05, 0.01]) 

cv t = stats.t.ppf(1 - alpha, 522) 

print(f'cv t: (cv t)Wn') 


# CV for alpha-5* and 1% using the normal approximation: 
cv n = stats.norm.ppf(1 - alpha) 
print(f'cv n: (cv n)Wn') 


348 Python Scripts 


Script 4.4: Example-4-1.py 
import wooldridge as woo 


import numpy as np 
import statsmodels.formula.api as smf 


wagel = woo.dataWoo ('wagel') 
reg = smf.ols(formula-'np.log(wage) ~ educ + exper + tenure’, data-wagel) 


results - reg.fit() 
print(f'results.summary(): \n{results.summary()}\n’) 


Script 4.5: Example-4-8.py 
import wooldridge as woo 
import numpy as np 
import statsmodels.formula.api as smf 


rdchem = woo.dataWoo('rdchem') 


# OLS regression: 

reg = smf.ols(formula-'np.log(rd) ~ np.log( 
reg.fit() 
print(f'results.summary(): \n{results.summary()}\n’) 


) * profmarg', data-rdchem) 


# 95% CI: 
CI95 = results.conf_int (0.05) 
print (£/CI95: \n{CI95}\n’) 


# 99% CI: 
CI99 = results.conf_int (0.01) 
print(f'CI99: \n{CI99}\n’) 


Script 4.6: F-Test .py 
import wooldridge as woo 
import numpy as np 
import statsmodels.formula.api as smf 
import scipy.stats as stats 


mlbl = woo.dataWoo('mlbl') 
n = mlbl.shape[0] 


# unrestricted OLS regression: 

reg ur = smf.ols( 
formula-'np.log(salary) ~ years + gamesyr + bavg + hrunsyr + rbisyr', 
data-mlbl) 

fit ur - reg ur.fit() 

r2 ur = fit ur.rsquared 

print(f'r2 ur: (r2 ur)Wn') 


4 restricted OLS regression: 

reg r - smf.ols(formula-'np.log(salary) - years * gamesyr', data-mlbl) 
fit r = reg r.fit() 

r2 r - fit r.rsquared 

print(f'r2 r: (r2 r)Wn') 


4 F statistic: 
fstat = (r2 ur - r2 r) / (1 - r2 ur) + (n- 6) / 3 
print(f'fstat: (fstat)Wn') 
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# CV for alpha=1% using the F distribution with 3 and 347 d.f.: 
cv = stats.f.ppf(1 - 0.01, 3, 347) 
print(f'cv: (cv)Wn') 


# p value = 1-cdf of the appropriate F distribution: 
fpval - 1 - stats.f.cdf(fstat, 3, 347) 
print(f'fpval: (fpval)Wn') 


p M — —— —- Script 4.7: F-Test-Automatic.py 
import wooldridge as woo 
import numpy as np 

import statsmodels.formula.api as smf 


mlbl = woo.dataWoo('mlbl') 


formula-'np.log(salary) ~ years + gamesyr + bavg + hrunsyr + rbisyr’, 
data-mlbl) 
results - reg.fit() 


# automated F test: 

['bavg = 0’, 'hrunsyr = 0’, 'rbisyr = 0'] 
jults.f test (hypotheses) 

statistic[0] [0] 

ftest.pvalue 


print(f'fstat: {fstat}\n’) 
print(f'fpval: (fpval)Wn') 


Script 4.8: F-Test-Automatic2.py 
import wooldridge as woo 
import numpy as np 
import statsmodels.formula.api as smf 


mlbl = woo.dataWoo ('mlbl') 


# OLS regression: 

reg = smf.ols( 
formula-'np.log(salary) ~ years + gamesyr + bavg + hrunsyr + rbisyr’, 
data-mlbl) 

results - reg.fit() 


# automated F test: 

hypotheses = ['bavg = 0’, 'hrunsyr = 2«rbisyr'] 
ftest = results. f test (hypotheses) 

fstat = ftest.statistic[0] [0] 

fpval = ftest.pvalue 


print (f’fstat: {fstat}\n’) 
print (f’fpval: {fpval}\n’) 


5. Scripts Used in Chapter 05 


350 Python Scripts 


Script 5.1: Sim-Asy-OLS-norm.py 
import numpy as np 
import pandas as pd 
import statsmodels.formula.api as smf 
import scipy.stats as stats 


# set the random seed: 
np.random.seed(1234567) 


# set sample size and number of simulations: 
n= 100 
r - 10000 


# set true parameters: 
betad = 1 

betal = 0.5 

sx = 1 

ex = 4 


# initialize bl to store results later: 
bl = np.empty (r) 


# draw a sample of x, fixed over replication: 
x = stats.norm.rvs(ex, sx, size=n) 


# re r times: 
for i in range(r 
# draw a sample of u (std. normal): 
u = stats.norm.rvs(0, 1, size=n) 
y = beta0 + betal + x +u 
df = pd.DataFrame(('y': y, ‘x’: x)) 


# estimate conditional OLS: 

reg = smf.ols(formula-'y ~ x’, data-df) 
results = reg.fit() 

bl[i] = results.params['x'] 


Sim-Asy-OLS-chisq.py 


import numpy 
import pandas as pd 

import statsmodels.formula.api as smf 
import scipy.stats as stats 


# set the random seed: 
np.random.seed(1234567) 


# set sample size and number of simulations: 
100 
10000 


4 set true parameters: 
beta0 = 1 

betal = 0.5 

sx = 1 

ex = 4 


# initialize bl to store results later: 
bl = np. empty (r) 
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# draw a sample of x, fixed over replications: 
x = stats.norm.rvs(ex, sx, size=n) 


# repeat r time: 
for i in range(r): 
# draw a sample of u (standardized chi-squared[1]): 
u = (stats.chi2.rvs(1, size-n) - 1) / np.sqrt (2) 
y = beta0 + betal + x +u 
df - pd.DataFrame(('y': y, 'x': 


x) 


# estimate conditional OLS: 

reg - smf.ols(formula-'y - x', data-df) 
results - reg.fit() 

bl[i] = results.params['x'] 


Script 5.3: Sim-Asy-OLS-uncond.py 
import numpy as np 
import pandas as pd 
import statsmodels.formula.api as smf 
import scipy.stats as stats 


4 set the random seed: 
np.random.seed (1234567) 


# set sample size and number of simulations: 
100 
10000 


betad 
betal 
s: 


21 
=4 


# initialize bl to store 
bl = np.empty(r) 


4 repeat r time. 

for i in range(r): 
# draw a sample of x, varying over replications: 
x = stats.norm.rvs(ex, sx, size=n) 


# draw a sample of u (std. normal): 
u = stats.norm.rvs(0, 1, size=n) 

y = beta0 + betal + x +u 

df - pd.DataFrame(('y': y, 'x': x)) 


# estimate unconditional OLS: 

reg = smf.ols(formula-'y ~ x’, data-df) 
results = reg. fit() 

bl[i] = results.params['x'] 


Script 5.4: Example-5-3.py 


import wooldridge as woo 
import statsmodels.formula.api as smf 
import scipy.stats as stats 


crimel = woo.dataWoo(’crimel’) 
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# 1. estimate restricted model: 
smf.ols(formula-'narr86 ~ pcnv + ptime86 + qemp86', data-crimel) 
reg r.fit() 

r2 r - fit r.rsquared 

print(f'r2 r: (r2 r)Wn') 


# 2. regression of residuals from restricted model: 

crimel['utilde'/] = fit r.resid 

reg LM = smf.ols(formula-'utilde ~ pcnv + ptime86 + qemp86 + avgsen + tottime’, 
data-crimel) 

fit IM = reg IM.fit() 

r2 LM - fit LM.rsquared 

print(f/r2 IM: (r2 IM)Wn') 


# 3. calculation of LM test statistic: 
LM = r2 LM « fit LM.nobs 
print(f'LM: {LM}\n’) 


# 4. critical value from chi-squared distribution, alph: 
cv = stats.chi2.ppf(1 - 0.10, 2) 
print(f'cv: {cv}\n’) 


4 5. p value (alternative to critical value): 
pval = 1 - stats.chi2.cdf(LM, 2) 
print(f'pval: {pval}\n’) 


4 6. compare to F-test: 

reg = smf.ols(formula-'narr86 ~ pcnv + ptime86 + qemp86 + avgsen + tottime', 
data=crimel) 

results = reg.fit() 

s = ['avgsen = 0’, 'tottime = 0'] 

ults.f test (hypotheses) 

ftest.statistic[0] [0] 

fpval - ftest.pvalue 

print(f'fstat: (fstat)Wn') 

print(f'fpval: (fpval)Wn') 


6. Scripts Used in Chapter 06 


Script 6.1: Data-Scaling.py 
import wooldridge as woo 
import pandas as pd 
import statsmodels.formula.api as smf 


bwght = woo.dataWoo ('bwght') 


# regress and report coefficients: 
reg = smf.ols(formula-'bwght ~ cigs + faminc’, data=bwght) 
results = reg.fit() 


# weight in pounds, manual way: 
bwght['bwght lbs'] = bwght['bwght'] / 16 

reg lbs = smf.ols(formula-'bwght lbs ~ cigs + faminc', data-bwght) 
results lbs - reg lbs.fit() 
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# weight in pounds, direct way: 
reg lbs2 = smf.ols(formula-'I(bwght/16) ~ cigs + faminc', data=bwght) 
results lbs2 = reg lbs2.fit() 


# packs of cigarettes: 
reg packs - smf.ols(formula-'bwght - I(cigs/20) * faminc', data-bwght) 
results packs - reg packs.fit() 


# compare results: 
table - pd.DataFrame(('b': round(results.params, 4), 

/b lbs': round(results lbs.params, 4), 

'b lbs2': round(results lbs2.params, 4), 

'b packs': round(results packs.params, 4)}) 
print(f'table: \n{table}\n’) 


Script 62: Example-6-1.py 
import wooldridge as woo 
import pandas as pd 
import numpy as np 
import statsmodels.formula.api as smf 


4 define a function for the standardization: 
def scale(x): 


x mean = np.mean (x) 
x_var = np.var(x, ddof=1) 
x scaled = (x - x mean) / np. sqrt (x_var) 


return x_scaled 


# standardize and estimate: 

hprice2 = woo.dataWoo(’hprice2’) 

hprice2['price sc'] = scale(hprice2[’ price’ ]) 
hprice2['nox sc'] = scale(hprice2[’nox’ ]) 
hprice2['crime sc'] = scale(hprice2[’ crime’ ]) 
hprice2['rooms sc'] = scale(hprice2[’ rooms’ ]) 
hprice2['dist sc'] = scale(hprice2['dist']) 
hprice2['stratio sc'] = scale (hprice2['stratio']) 


c 


reg = smf.ols( 
formula-'price sc ~ 0 + nox sc + crime sc + rooms sc + dist sc + stratio sc', 
data-hprice2) 

results - reg.fit() 


# print regression table: 

table - pd.DataFrame(('b': round(results.params, 4), 
'se': round(results.bse, 4), 
't': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4))) 

print(f'table: \n{table}\n’) 


p Script 6.3: Formula-Logarithm.py 
import wooldridge as woo 
import numpy as np 
import pandas as pd 
import statsmodels.formula.api as smf 


hprice2 = woo.dataWoo('hprice2') 
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reg = smf.ols(formula-'np.log(price) ~ np.log(nox) + rooms’, data-hprice2) 
results = reg. fit () 


# print regression table: 
table = pd.DataFrame(('b': round(results.params, 4), 
round(results.bse, 4), 
/t': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4)}) 
print(f'table: \n{table}\n’) 


Script 64: Example-6-2.py 


import wooldridge as woo 
import numpy as np 

import pandas as pd 

import statsmodels.formula.api as smf 


hprice2 = woo.dataWoo ('hprice2') 


reg = smf.ols( 
formula-'np.log(price) ~ np.log(nox) +np.log (dist) +rooms+I (rooms*«2) +stratio’, 
data=hprice2) 

results = reg. fit() 


round(results.params, 4), 
round (results .b: 
't': round(results.tvalu 
'pval': round(results.pvalu 
print (f/table: \n{table}\n’) 


M — —— — Script 6.5: Example-6-2-Ftest.py 
import wooldridge as woo 


import numpy as np 
import statsmodels.formula.api as smf 


hprice2 = woo.dataWoo('hprice2') 
n = hprice2.shape[0] 


reg = smf.ols( 
formula-'np.log(price) ~ np.log(nox) +np. log (dist) +rooms+I (rooms*«2) +stratio’, 
data=hprice2) 

results = reg. fit () 


# implemented F test for rooms: 
hypotheses = ['rooms = 0’, 'I(rooms +*+ 2) = 0'] 


ftest = results.f test (hypotheses) 
fstat = ftest .statistic[0] [0] 
fpval = ftest .pvalue 


print (f’fstat: {fstat}\n’) 
print(f'fpval: {fpval}\n’) 


M — — — Script 6.6: Example-6-3.py 
import wooldridge as woo 
import numpy as np 

import pandas as pd 
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import statsmodels.formula.api as smf 


attend = woo.dataWoo(’ attend’ ) 
n = attend.shape[0] 


reg = smf.ols(formula-'stndfnl ~ atndrte«priGPA + ACT + I(priGPA««2) + I(ACT««2)', 
data=attend) 
results = reg.fit() 


# print regression table: 

table = pd.DataFrame(('b': round(results.params, 4), 
‘se’: round(results.bse, 4), 
't': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4)]) 

print(f'table: \n{table}\n’) 


# estimate for partial effect at priGPA-2.59: 

b = results.params 

partial effect = b['atndrte'] + 2.59 + b['atndrte:priGPA'] 
print(f'partial effect: (partial effect)Wn') 


# F test for partial effect at priGPA-2.59: 
'atndrte + 2.59 + atndrte:priGPA = 0’ 
results.f test (hypotheses) 
ftest.statistic[0] [0] 

fpval - ftest.pvalue 


print(f'fstat: (fstat)Wn') 
print(f'fpval: {fpval}\n’) 


Script 6.7: Predictions.py 
import wooldridge as woo 
import statsmodels.formula.api as smf 
import pandas as pd 


gpa2 = woo.dataWoo ('gpa2') 


reg = smf.ols(formula-'colgpa ~ sat + hsperc + hsize + I(hsize**2)', data-gpa2) 
results - reg.fit() 


# print regression table: 

table - pd.DataFrame(('b': round(results.params, 4), 
'se': round(results.bse, 4), 
't': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4)}) 

print(f'table: \n{table}\n’) 


# generate data set containing the regressor values for predictions: 
cvaluesl = pd.DataFrame(('sat': [1200], 'hsperc': [30], 

'hsize': [5]), index-['newPersonl']) 
print(f'cvaluesl: \n{cvalues1}\n’) 


# point estimate of prediction (cvaluesl): 
colgpa_predl = results.predict (cvalues1) 
print(f'colgpa pred1: \n{colgpa_predi}\n’) 


# define three sets of regressor variables: 
cvalues2 = pd.DataFrame(('sat': [1200, 900, 1400, ], 
'hsperc': [30, 20, 5], 'hsize': [5, 3, 11), 
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index-['newPersonl', 'newPerson2', 'newPerson3']) 
print(f'cvalues2: \n{cvalues2}\n’) 


# point estimate of prediction (cvalues2): 
colgpa pred2 - results.predict (cvalues2) 
print(f'colgpa pred2: \n{colgpa_pred2}\n’) 


Script 6.8: Example-6-5.py 
import wooldridge as woo 
import statsmodels.formula.api as smf 
import pandas as pd 


gpa2 = woo.dataWoo(’gpa2’) 


reg = smf.ols(formula-'colgpa ~ sat + hsperc + hsize + I(hsizess2)’, data-gpa2) 
results = reg. fit () 


# define three sets of regressor variables: 
cvalues2 = pd.DataFrame(('sat': [1200, 900, 1400, ], 
'hsperc': [30, 20, 5], 'hsize': [5, 3, 1]}, 
index-['newPersonl', 'newPerson2', 'newPerson3']) 


# point estimates and 95% confidence and prediction intervals: 
Colgpa PICI 95 = results.get prediction(cvalues2).summary frame (alpha-0.05) 
print(f'colgpa PICI 95: \n{colgpa_PICI_95}\n’) 


# point estimates and 99% confidence and prediction intervals: 
colgpa PICI 99 = results.get prediction(cvalues2).summary frame (alpha-0.01) 
print(f'colgpa PICI 99: Wn(colgpa PICI 99) n') 


Script 6.9: E£fects-Manual.py 
import wooldridge as woo 
import numpy as np 
import pandas as pd 
import statsmodels.formula.api as smf 
import matplotlib.pyplot as plt 


hprice2 = woo.dataWoo('hprice2') 


# repeating the regression from Example 6.2: 

reg = smf.ols( 
formula-'np.log(price) ~ np.log(nox) +np. log (dist) +rooms+I (rooms*«2) +stratio’, 
data=hprice2) 

results = reg. fit () 


# predictions with rooms = 4-8, all others at the sample mean: 
nox mean = np.mean(hprice2['nox']) 
dist mean = np.mean (hprice2['dist']) 
stratio mean = np.mean(hprice2['stratio']) 
X - pd.DataFrame(('rooms': np.linspace(4, 8, num-5), 
'nox': nox mean, 
‘dist’: dist mean, 
'stratio': stratio mean)) 
print(f'X: \n{X}\n‘) 


# calculate 95% confidence interval: 
lpr PICI = results. get_prediction (X) .summary frame (alpha=0.05) 
lpr CI = lpr PICI[['mean', 'mean ci lower’, 'mean ci upper']] 


7. Scripts Used in Chapter 07 357 


print(f'lpr CI: \n{1pr_CI}\n’) 


# plot: 

plt.plot(X['rooms'], lpr CI['mean'], color-'black', 
linestyle-'-', label-'') 

plt.plot(X['rooms'], lpr CI['/mean ci upper'], color-'lightgrey', 
linestyle-'--', label-'upper CI') 

plt.plot(X['rooms'], lpr CI['/mean ci lower'], color-'darkgrey', 
linestyle-'--', label-'lower CI’) 


plt.ylabel('lprice') 
plt.xlabel('rooms') 

plt.legend() 
plt.savefig('PyGraphs/Effects-Manual.pdf') 
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Script 7.1: Example-7-1.py 
import wooldridge as woo 
import pandas as pd 
import statsmodels.formula.api as smf 


wagel = woo.dataWoo('wagel') 


reg = smf.ols(formula-'wage ~ female + educ + exper + tenure’, dat. 
results = reg. fit() 


round(results.params, 4), 
round(results.bse, 4), 
round(results.tvalues, 4), 

'pval': round(results.pvalues, 4))) 
print(f'table: \n{table}\n’) 


Script 7.2 Example-7-6.py 
import wooldridge as woo 

import numpy as np 

import pandas as pd 

import statsmodels.formula.api as smf 


wagel = woo.dataWoo('wagel') 


reg = smf.ols(formula-'np.log(wage) ~ married«female + educ + exper +’ 
'I(exper++2) + tenure + I(tenure««2)', data-wagel) 
results - reg.fit() 


4 print regression table: 
table - pd.DataFrame(('b': round(results.params, 4), 
‘se’: round(results.bse, 4), 
't': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4)]) 
print(f'table: \n{table}\n’) 


LL — — — — Script 7.3: Example-7-1-Boolean.py 
import wooldridge as woo 
import pandas as pd 
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import statsmodels.formula.api as smf 
wagel = woo.dataWoo ('wagel') 


# regression with boolean variable: 
wagel['isfemale'] = (wagel['female'] == 1) 

reg = smf.ols(formula-'wage ~ isfemale + educ + exper + tenure’, data-wagel) 
results - reg.fit() 


# print regression table: 
table - pd.DataFrame(('b': round(results.params, 4), 

‘se’: round(results.bse, 4), 

't': round(results.tvalues, 4), 

'pval': round(results.pvalues, 4)}) 
print(f'table: \n{table}\n’) 


Script 7.4: Regr-Categorical.py 
import pandas as pd 
import numpy as np 
import statsmodels.formula.api as smf 


CPS1985 - pd.read csv('data/CPS1985.csv') 
# rename variable to make outputs more compact: 
CPS1985['oc'] = CPS1985['occupation'] 


# table of categories and frequencies for two categorical variabl 


freq gender - pd.crosstab(CPS1985['gender'], columns-'count') 
print(f'freq gender: \n{freq gender)n') 
freq occupation = pd.crosstab(CPS1985['oc'], columns-'count') 


print(f'freq occupation: \n{freq_occupation}\n’) 


# directly using categorical variables in regression formula: 
reg = smf.ols(formula-'np.log(wage) ~ education +’ 
‘experience + C(gender) + C(oc)’, data=CPS1985) 


results = reg. fit() 


table = pd.DataFrame(('b': round(results.params, 4), 
‘se’: round(results.bse, 4), 
't': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4)}) 
print(f'table: \n{table}\n’) 


# rerun regression with different reference category: 
reg newref = smf.ols(formula-'np.log(wage) ~ education + experience + ' 
'/C(gender, Treatment ("male")) + ' 
'C(oc, Treatment ("technical"))’, data-CPS1985) 
results newref = reg newref.fit() 


# print results: 
table newref = pd.DataFrame({’b’: round(results newref.params, 4), 
‘se’: round(results newref.bse, 4), 
/t': round(results newref.tvalues, 4), 
‘pval’: round(results newref.pvalues, 4)]) 
print(f'table newref: \n{table_newref}\n’) 
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p ——————— Script 7.5: Regr-Categorical-Anova.py 
import pandas as pd 
import numpy as np 
import statsmodels.api as sm 

import statsmodels.formula.api as smf 


CPS1985 - pd.read csv('data/CPS1985.csv') 


# run regression: 

reg - smf.ols( 
formula-'np.log(wage) ~ education + experience + gender + occupation’, 
data-CPS1985) 

results - reg.fit() 


# print regression table: 
table reg - pd.DataFrame(('b': round(results.params, 4), 

'se': round(results.bse, 4), 

't': round(results.tvalues, 4), 

‘pval’: round(results.pvalues, 4))) 
print(f'table reg: \n{table_reg}\n’) 


# ANOVA table 
table anova = sm.stats.anova lm(results, typ=2) 
print(f'table anova: \n{table_anova}\n’) 


Script 7.6: Example-7-8.py 
import wooldridge as woo 
import numpy as np 
import pandas as pd 
import smodels.api as sm 
import smodels.formula.api as smf 


lawsch85 = woo.dataWoo(’ lawsch85’) 


# define cut points for the rank: 
cutpts = [0, 10, 25, 40, 60, 100, 175] 


# create categorical variable containing ranges for the rank: 


lawsch8S['rc'] = pd.cut(lawsch85['rank'], bins=cutpts, 
labels-['(0,10]', '(10,25]', '(25,40]', 
'(40,60]', '(60,100]', '(100,1751'1) 


# display frequencies: 
freq = pd.crosstab(lawsch85[’rc’], columns-'count') 
print(f'freq: \n{freq}\n’) 


# run regression: 
reg = smf.ols(formula-'np.log(salary) ~ C(rc, Treatment("(100,175]")) +’ 
‘LSAT + GPA + np.log(libvol) + np.log(cost)', 
data=lawsch85) 
results = reg. fit() 


# print regression table: 
table reg = pd.DataFrame(('b': round(results.params, 4), 

‘se’: round(results.bse, 4), 

't': round(results.tvalues, 4), 

'pval': round(results.pvalues, 4)}) 
print(f'table reg: \n{table_reg}\n’) 
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# ANOVA table: 
table anova = sm.stats.anova lm(results, typ=2) 
print(f'table anova: \n{table_anova}\n’) 


~ — Script 7.7: Dummy-Interact.py 
import wooldridge as woo 
import pandas as pd 
import statsmodels.formula.api as smf 


gpa3 = woo.dataWoo('gpa3') 


# model with full interactions with female dummy (only for spring data): 

reg = smf.ols(formula-'cumgpa ~ female * (sat + hsperc + tothrs)', 
data-gpa3, subset-(gpa3['spring'] == 1)) 

results - reg.fit() 


# print regression table: 
table - pd.DataFrame(('b': round(results.params, 4), 
round(results.bse, 4), 

't': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4))) 


print(f'table: \n{table}\n’) 


# F-Test for HO (the interaction coefficients of 'female' are zero): 
= [fema p^; 

' female female:tothrs - 0'] 
results.f test (hypothe: 


ftest = 
fstat = ftest.statistic[0] [0] 


fpval = ftest .pvalue 
print(f'fstat: (fstat)Wn') 
print(f'fpval: (fpval)Wn') 


Script 7.8: Dummy-Interact-Sep.py 
import wooldrid. woo 


import pandas as pd 
import statsmodels.formula.api as smf 


gpa3 = woo.dataWoo('gpa3') 


# estimate model for males (& spring data): 
reg m = smf.ols(formula-'cumgpa ~ sat + hsperc + tothrs' 


1) & (gpa3['female'] = 


0) 
results m - reg m.fit() 


4 print regression table: 

table m = pd.DataFrame(('/b': round(results m.params, 4), 
‘se’: round(results m.bse, 4), 
'/t': round(results m.tvalues, 4), 
‘pval’: round(results m.pvalues, 4)}) 

print(f'table m: \n{table_m}\n’) 


# estimate model for females (& spring data): 
reg f = smf.ols(formula-'cumgpa - sat + hsperc + tothrs', 


subset- (gpa3[' spring'] 
results f - reg f.fit() 


1) & (gpa3['female'] == 1)) 
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# print regression table: 

table f = pd.DataFrame(('b': round(results f.params, 4), 
‘se’: round(results f.bse, 4), 
/t': round(results f.tvalues, 4), 
'pval': round(results f.pvalues, 4)}) 

print(f'table f: \n{table_f£}\n’) 


8. Scripts Used in Chapter 08 


p — Script 8.1: Example-8-2.py 
import wooldridge as woo 
import pandas as pd 

import statsmodels.formula.api as smf 


reg = smf.ols(formula-'cumgpa - sat + hsperc + tothrs + female + black + white’, 
data-gpa3, subset-(gpa3['spring'] == 1)) 


# estimate default model (only for spring data): 
results default - reg.fit() 


table default = pd.DataFrame(('b': round(results default.params, 5), 
's round(results default.bse, 5), 
't': round(results default.tvalu: 
'pval': round(results default.pvalu 
print(f'table default: \n{table_default}\n’) 


5) 


'syn 


# estimate model with White SE (only for spring data): 
results white = reg.fit(cov type-'HC0') 


table white = pd.DataFrame(('b': round(results white.params, 5), 
‘se’: round(results white.bse, 5), 
't': round(results white.tvalues, 5), 
'pval': round(results white.pvalues, 5))) 
print(f'table white: \n{table_white}\n’) 


# estimate model with refined White SE (only for spring data): 
results refined - reg.fit(cov type-'HC3') 
table refined = pd.DataFrame({’b’: round(results refined.params, 5), 
'se': round(results refined.bse, 5), 
't': round(results refined.tvalues, 5), 
'pval': round(results refined.pvalues, 5)}) 
print(f'table refined: \n{table_refined}\n’) 


pM —— Script 8.2: Example-8-2-cont.py 
import wooldridge as woo 
import statsmodels.formula.api as smf 


gpa3 = woo.dataWoo(’ gpa3’) 


# definition of model and hypotheses: 
reg = smf.ols(formula-'cumgpa - sat + hsperc + tothrs + female + black + white’, 
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gpa3[' spring’ ] 
hypotheses = ['black = 0’, ‘white = 0'] 


4 F-Tests using different variance-covariance formulas: 
# ususal VCOV: 

results default - reg.fit() 

ftest default - results default.f test (hypotheses) 
fstat default - ftest default.statistic[0][0] 

fpval default - ftest default.pvalue 

print(f'fstat default: {fstat_default}\n’) 
print(f'fpval default: (fpval default) Wn') 


# refined White VCOV: 

results hc3 - reg.fit(cov type-'HC3') 
ftest hc3 - results hc3.f test (hypotheses) 
fstat hc3 = ftest hc3.statistic[0] [0] 
fpval hc3 - ftest hc3.pvalue 
print(f'fstat hc3: (fstat hc3)Wn') 
print(f'fpval hc3: (fpval hc3)Wn') 


4 classical White VCOV: 

results hc0 = reg.fit(cov type-'HCO') 

ftest hc0 = results hc0.f test (hypotheses) 
fstat_hcO = ftest hcO.statistic[0] [0] 
fpval_hcO = ftest hc0.pvalue 

print(f'fstat hc0: {fstat_hc0}\n’) 
print(f'fpval hc0: {fpval_hc0}\n’) 


Script 8.3: Example-8-4.py 
import wooldridge as woo 
import pandas as pd 
import statsmodels.api as sm 
import statsmodels.formula.api as smf 
import patsy as pt 


hpricel = woo.dataWoo('hpricel') 


4 estimate model: 
reg = smf.ols(formula-'price ~ lotsize + sqrft + bdrms', data-hpricel) 
results = reg.fit() 
table results - pd.DataFrame(('b': round(results.params, 4), 

'se': round(results.bse, 4), 

't': round(results.tvalues, 4), 

'pval': round(results.pvalues, 4)]) 
print(f'table results: \n{table_results}\n’) 


# automatic BP test (LM version): 
y, X = pt.dmatrices('price ~ lotsize + sqrft + bdrms’, 
data-hpricel, return type-'dataframe') 
result bp lm - sm.stats.diagnostic.het breuschpagan(results.resid, X) 
bp lm statistic = result bp lm[0] 
bp 1m pval = result bp lm[1] 
print(f'bp lm statistic: (bp lm statistic) Wn') 
print(f'bp lm pval: (bp lm pval)in') 


# manual BP test (F version): 

hpricel['resid sq'] = results.resid «« 2 

reg resid = smf.ols(formula-'resid sq ~ lotsize + sqrft + bdrms', data-hpricel) 
results resid - reg resid.fit() 
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bp F statistic = results resid.fvalue 

bp F pval - results resid.f pvalue 

print(f'bp F statistic: (bp F statistic) n') 
print(f'bp F pval: (bp F pval]in') 


Script $4: Example-8-5.py 
import wooldridge as woo 

import numpy as np 

import pandas as pd 

import statsmodels.api as sm 

import statsmodels.formula.api as smf 

import patsy as pt 


hpricel = woo.dataWoo('hpricel') 


# estimate model: 

reg = smf.ols(formula-'np.log(price) ~ np.log(lotsize) + np.log(sqrft) + bdrms', 
data-hpricel) 

results - reg.fit() 


# BP tes 

y, X bp = pt.dmatrices('np.log(price) ~ np.log(lotsize) + np.log(sqrft) + bdrms', 
data-hpricel, return type-'dataframe') 

result bp - sm.stats.diagnostic.het breuschpagan(results.resid, X bp) 

bp statistic - result bp[0] 

bp pval = result bp[1] 

print(f'bp statistic: (bp statistic)Wn') 

print(f'bp pval: {bp_pval}\n’) 


# White test: 

X wh = pd.DataFrame(('const': 1, ‘fitted reg': results. fittedvalues, 
‘fitted reg sq': results.fittedvalues ++ 2)) 

result white - sm.stats.diagnostic.het breuschpagan(results.resid, X wh) 

white statistic - result white[0] 

white pval = result white[1] 

print(f'white statistic: (white statistic)Wn') 

print(f'white pval: (white pval)Wn') 


Script 8.5: Example-8-6.py 
import wooldridge as woo 
import pandas as pd 
import statsmodels.formula.api as smf 


k401lksubs = woo.dataWoo(’401ksubs’) 


# subsetting data: 
k401ksubs_sub = k40lksubs[k40lksubs['fsize'] == 1] 


# OLS (only for singles, i.e. 'fsize'--1): 

reg ols = smf.ols(formula-'nettfa ~ inc + I((age-25)**2) + male + e401k’, 
data=k401ksubs_sub) 

results_ols = reg ols. fit (cov_type=’HCO’) 


# print regression table: 

table ols = pd.DataFrame(('b': round(results ols.params, 4), 
‘se’: round(results ols.bse, 4), 
't': round(results ols.tvalues, 4), 
'pval': round(results ols.pvalues, 4)}) 
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print (f’table_ols: \n{table_ols}\n’) 


# WLS: 

wls weight = list (1 / k401ksubs_sub[’inc’]) 

reg wls = smf.wls(formula-'nettfa ~ inc + I((age-25)**2) + male + e401k’, 
weights=wls_weight, data=k401ksubs_sub) 

results wls = reg wls.fit() 


# print regression table: 

table wls - pd.DataFrame(('b': round(results wls.params, 4), 
‘se’: round(results wls.bse, 4), 
't': round(results wls.tvalues, 4), 
‘pval’: round(results wls.pvalues, 4)]) 

print(f'table wls: \n(table_wls}\n’) 


Script 8.6: WLS-Robust .py 
import wooldridge as woo 

import pandas as pd 

import statsmodels.formula.api as smf 


k40lksubs = woo.dataWoo(’ 401lksubs’ ) 


# subsetting dat 
k40lksubs sub = k401ksubs[k401ksubs[‘fsize’] 


# WLS: 

wls weight = list (1 / k401ksubs_sub[’inc’]) 

reg wls = smf.wls(formula-'nettfa ~ inc + I((age-25)*42) + male + e401k’, 
weights=wls weight, data=k401ksubs_sub) 


4 non-robust (default) results: 

results wls - reg wls.fit() 

table default - pd.DataFrame(('b': round(results wls.params, 4), 
‘si round(results wls.bse, 4), 


't': round(results wls.tvalu 4), 
'pval': round(results wls.pvalues, 4)]) 
print(f'table default: \n{table_default}\n’) 


# robust results (Refined White SE): 

results white = reg wls.fit(cov type-'HC3') 

table white - pd.DataFrame(('b': round(results white.params, 4), 
'se': round(results white.bse, 4), 
't': round(results white.tvalues, 4), 
'pval': round(results white.pvalues, 4) }) 

print(f'table white: \n{table_white}\n’) 


oe Script&7: Example-8-7.py — 
import wooldridge as woo 

import numpy as np 

import pandas as pd 

import statsmodels.api as sm 

import statsmodels.formula.api as smf 

import patsy as pt 


smoke = woo.dataWoo (' smoke’ ) 


# OLS: 
reg ols = smf.ols(formula-'cigs ~ np.log(income) + np.log(cigpric) +’ 
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'educ + age + I(age++2) + restaurn’, 
data=smoke) 

results ols = reg ols.fit() 

table ols = pd.DataFrame (('b' 


round(results ols.params, 4), 
round(results ols.bse, 4), 
't': round(results ols.tvalues, 4), 


'pval': round(results ols.pvalues, 4))) 
print(f'table ols: \n{table_ols}\n’) 


# BP test: 

y, X = pt.dmatrices('cigs ~ np.log(income) + np.log(cigpric) + educ +’ 
‘age + I(age««2) + restaurn', 
data-smoke, return type-'dataframe') 

result bp - sm.stats.diagnostic.het breuschpagan(results ols.resid, X) 

bp statistic - result bp[0] 

bp pval - result bp[1] 

print(f'bp statistic: (bp statistic)in') 

print(f'bp pval: (bp pval)Tn') 


# FGLS (estimation of the variance function): 

smoke['logu2/] = np.log(results ols.resid ** 2) 

reg fgls = smf.ols(formula-'logu2 ~ np.log(income) + np.log(cigpric) +’ 
‘educ + age + I(age++2) + restaurn', data-smoke) 

results fgls - reg fgls.fit() 

table fgls - pd.DataFrame(('b 
‘se’: round(results fgls.b 
't': round(results fgls.tvalues, 4), 
‘pval’: round(results fgls.pvalues, 4))) 

print(f'table fgls: \n({table_fgls}\n’) 


4 FGLS (WLS): 

wls weight = list(1 / np.exp(results fgls.fittedvalues)) 

reg wls = smf.wls(formula-'cigs ~ np.log(income) + np.log(cigpric) +’ 
'educ + age + I(ages*2) + restaurn', 

weights=wls weight, data=smoke) 

results wls = reg wis. fit () 

table wls = pd.DataFrame(('b': round(results wls.params, 4), 
'se': round(results wls.bse, 4), 
't': round(results wls.tvalues, 4), 
'pval': round(results wls.pvalues, 4))) 

print(f'table wls: \n{table_wis}\n’) 
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= Script 9.1: Example-9-2-manual.py 
import wooldridge as woo 

import pandas as pd 

import statsmodels.formula.api as smf 


hpricel = woo.dataWoo('hpricel') 
# original OLS: 


reg = smf.ols(formula-'price ~ lotsize + sqrft + bdrms’, data-hpricel) 
results = reg. fit() 


# regression for RESET test: 
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hpricel['fitted sq'] = results.fittedvalues «« 2 
hpricel['fitted cub'] = results.fittedvalues ++ 3 
reg reset = smf.ols(formula-'price ~ lotsize + sqrft + bdrms +’ 

/ fitted sq + fitted cub', data-hpricel) 
results reset - reg reset.fit() 


# print regression tabl 
table - pd.DataFrame(('b': round(results reset.params, 4), 
‘se’: round(results reset.bse, 4), 
't': round(results reset.tvalues, 4), 
'pval': round(results reset.pvalues, 4))) 
print(f'table: \n{table}\n’) 


# RESET test (HO: all coeffs including "fitted" are-0): 
hypotheses = ['fitted sq = 0’, 'fitted cub = 0'] 

ftest man = results reset.f test (hypotheses) 

fstat man = ftest man.statistic[0][0] 

fpval man - ftest man.pvalue 


print(f'fstat man 
print (f'fpval man 


(fstat man)Wn') 
(£pval man)in') 


Script 9.2: Example-9-2-automatic.py 
import wooldridge as woo 
import statsmodels.formula.api as smf 
import statsmodels.stats.outliers influence as smo 


hpricel = woo.dataWoo('hpricel') 
# original linear regression: 


reg = smf.ols(formula-'price ~ lotsize + sqrft + bdrms', data-hpricel) 
results - reg.fit() 


4 automated RESET test: 
t output = smo. ri 


t ramsey(r 


sults, degree-3) 
auto - reset output.statistic[0][0] 
fpval auto - reset output.pvalue 


print(f'fstat auto: (fstat auto)Wn') 
print(f'fpval auto: (fpval auto)Wn') 


M — — — —— Script 9.3: Nonnested-Test.py 
import wooldridge as woo 
import numpy as np 

import statsmodels.api as sm 

import statsmodels.formula.api as smf 


hpricel = woo.dataWoo('hpricel') 
4 two alternative model 


regl = smf.ols(formula-'price - lotsize + sqrft + bdrms', data-hpricel) 
resultsl - regl.fit() 


reg2 = smf.ols(formula-'price ~ np.log(lotsize) +’ 
'np.log(sqrft) + bdrms’, data-hpricel) 
results2 - reg2.fit() 


# encompassing test of Davidson & MacKinnon: 
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# comprehensive model: 

reg3 = smf.ols(formula-'price ~ lotsize + sqrft + bdrms + " 
‘np.log(lotsize) + np.log(sqrft)', data-hpricel) 

results3 = reg3. fit () 


# model 1 vs. comprehensive model: 
anovaResults1 = sm.stats.anova lm(resultsl, results3) 
print (f’ anovaResults1: \n{anovaResults1}\n’) 


# model 2 vs. comprehensive model: 
anovaResults2 = sm.stats.anova lm(results2, results3) 
print(f'anovaResults2: \n{anovaResults2}\n’) 


Script 9.4: Sim-ME-Dep.py 
import numpy as np 

import scipy.stats as stats 

import pandas as pd 

import statsmodels.formula.api as smf 


# set the random seed: 
np.random.seed (1234567) 


# set sample size and number of simulations: 
n 1000 
r = 10000 


# set true parameters (be 
betad = 1 
betal = 0.5 


# initialize arrays to store results later (bl without ME, bl_me with ME): 
bl = np.empty(r) 
bl me = np.empty (r) 


# draw a sample of x, fixed over replications: 
x = stats.norm.rvs(4, 1, size-n) 


# repeat r time: 
for i in range(r): 

4 draw a sample of u: 

u = stats.norm.rvs(0, 1, size=n) 


# draw a sample of ystar: 
ystar = beta0 + betal + x + u 


# measurement error and mismeasured y: 
e0 = stats.norm.rvs(0, 1, size-n) 

y 7 ystar * e0 

df = pd.DataFrame(('ystar': ystar, ' 


Dy, 'x':ox)) 

# regress ystar on x and store slope estimate at position i: 
reg star = smf.ols(formula-'ystar ~ x’, data-df) 

results star - reg star.fit() 

bl[i] = results star.params['x'] 


# regress y on x and store slope estimate at position i: 
reg me - smf.ols(formula-'y - x', data-df) 

results me - reg me.fit() 

bl me[i] = results me.params['x'] 


368 


Python Scripts 


# mean with and without ME: 

bl mean = np.mean (bl) 

bl me mean - np.mean(bl me) 
print(f'bl mean: (bl mean]in') 
print(f'bl me mean: (bl me mean)Wn') 


# variance with and without ME: 
bl var - np.var(bl, ddof-1) 

bl me var - np.var(bl me, ddof-1) 
print(f'bl var: (bl var)Wn') 
print(f'bl me var: (bl me var)|n') 


Script 9.5: Sim-ME-Explan.py 
import numpy as np 
import scipy.stats as stats 
import pandas as pd 
import statsmodels.formula.api as smf 


# set the random seed: 
np.random.seed(1234567) 


# set sample size and number of simulations: 


n 1000 
r - 10000 
# 


t true parameters (betas) : 
1 


beta! 
betal = 0.5 


# initialize bl arrays to store results later: 
bl = np.empty(r) 
bl me - np.empty(r) 


# draw a sample of x, fixed over replication: 
xstar = stats.norm.rvs(4, 1, size=n) 


# repeat r times 
for i in range(r): 

# draw a sample of u: 

u = stats.norm.rvs(0, 1, size=n) 


# draw a sample of y: 
y = beta0 + betal + xstar + u 


# measurement error and mismeasured x: 

el = stats.norm.rvs(0, 1, size=n) 

x xstar + el 

df = pd.DataFrame({’y’: y, 'xstar': xstar, ‘x’: x)) 


# regress y on xstar and store slope estimate at position i: 
reg star = smf.ols(formula-'y ^ xstar', data-df) 

results star - reg star.fit() 

bl[i] = results star.params['xstar'] 


# regress y on x and store slope estimate at position i: 
reg me - smf.ols(formula-'y - x', data-df) 

results me - reg me.fit() 

bl me[i] = results me.params['x'] 
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# mean with and without ME: 

bl mean = np.mean (b1) 

bl me mean = np.mean(bl me) 
print(f'bl mean: (bl mean)n') 
print(f'bl me mean: (bl me mean) n') 


# variance with and without ME: 
bl var - np.var(bl, ddof-1) 

bl me var - np.var(bl me, ddof-1) 
print(f'bl var: (bl var)Wn') 
print(f'bl me var: (bl me var])Wn') 


Script9.6: NA-NaN-Inf.py 
import numpy as np 
import pandas as pd 
import scipy.stats as stats 


# nan and inf handling in numpy: 

x = np.array([-1, 0, 1, np.nan, np.inf, -np.inf]) 
np. log (x) 

np.array(1 / x) 
np.array(stats.norm.cdf(x)) 

isnanx - np.isnan(x) 


results = pd.DataFrame(('x': /logx': logx, 'invx 
/logx': logx, 'ncdf': ncdf, ' 


print(f'results: \n{results}\n’) 


invx, 
snanx': isnanx}) 


- Script 9.7: Missings.py 
import wooldridge as woo 
import pandas as pd 


lawsch85 woo.dataWoo (' lawsch85' ) 
lsat pd = lawsch85['LSAT'] 


# create boolean indicator for missings: 
missLSAT - lsat pd.isna() 


# LSAT and indicator for Schools No. 120-129: 

preview = pd.DataFrame(('lsat pd': lsat pd[119:129], 
‘missLSAT’: missLSAT[119:129]}) 

print (f’ preview: \n{preview}\n’) 


# frequencies of indicator: 
freq missLSAT = pd.crosstab(missLSAT, columns=’ count’) 
print (f’ freq missLSAT: Wn(freq missLSAT)Wn') 


# missings for all variables in data frame (counts): 
miss all = lawsch85.isna() 

colsums = miss all.sum(axis-0) 

print(f'colsums: \n{colsums}\n’) 


# computing amount of complete cases: 
complete cases = (miss all.sum(axis-1) == 0) 

freq complete cases = pd.crosstab(complete cases, columns-'count') 
print(f'freq complete cases: in(freq complete cases) n') 
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M — — Script 9.8: Missings-Analyses.py 
import wooldridge as woo 
import numpy as np 

import statsmodels.formula.api as smf 


lawsch85 = woo.dataWoo('lawsch85') 


# missings in numpy: 

x np = np.array (lawsch85['LSAT']) 
x np barl = np.mean(x np) 

x np bar2 - np.nanmean(x np) 
print(f'x np barl: (x np barl)in') 
print(f'x np bar2: (x np bar2)in') 


# missings in pandas: 

x pd = lawsch85[’ LSAT’ ] 

x pd barl = np.mean(x pd) 

x pd bar2 - np.nanmean(x pd) 
print (f/x pd barl: (x pd barl)Wn') 
print (f/x pd bar2: (x pd bar2)Wn') 


# observations and variabl 
print(f'lawsch85.shape: (lawsch85.shape)Wn') 


ngs are taken care of by default): 
smf.ols(formula-'np.log(salary) ~ LSAT + cost + age’, data=lawsch85) 
results = reg.fit() 

print (f/results.nobs: {results.nobs}\n’) 


- Script 9.9: Outliers.py 


import wooldridge as woo 
import numpy as np 
import statsmodels.api 
import statsmodels.formula.api as smf 
import matplotlib.pyplot as plt 


rdchem = woo.dataWoo('rdchem') 


# OLS regression: 
reg = smf.ols(formula-'rdintens ~ sales + profmarg’, data=rdchem) 
results = reg. fit () 


# studentized residuals for all observations: 
studres = results.get influence().resid studentized external 


# display extreme values: 

studres max - np.max(studres) 

studres min - np.min(studres) 
print(f'studres max: (studres max)in') 
print(f'studres min: (studres min)in') 


# histogram (and overlayed density plot): 
kde - sm.nonparametric.KDEUnivariate (studres) 
kde.fit() 


plt.hist(studres, color=’grey’, density-True) 
plt.plot(kde.support, kde.density, color-'black', linewidth-2) 
plt.ylabel('density') 
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plt.xlabel.('studres') 
plt.savefig('PyGraphs/Outliers.pdf') 


Script 9.10: LAD.py 


import wooldridge as woo 
import pandas as pd 
import statsmodels.formula.api as smf 


rdchem = woo.dataWoo ('rdchem') 


# OLS regression: 
reg ols = smf.ols(formula-'rdintens ~ I(sales/1000) + profmarg’, data-rdchem) 
results ols = reg ols. fit() 


table ols = pd.DataFrame(('b': round(results ols.params, 4), 
‘se’: round(results ols.bse, 4), 
't': round(results ols.tvalues, 4), 
'pval': round(results ols.pvalues, 4))) 
print(f'table ols: \n{table_ols}\n’) 


# LAD regression: 
reg lad = smf.quantreg(formula-'rdintens ~ I(sales/1000) + profmarg', data=rdchem) 
results lad = reg_lad. fit (q=.5) 


round(results_lad.params, 4), 
se’: round(results_lad.bse, 4), 
't': round(results lad.tvalues, 4), 
'pval': round(results lad.pval 
print(f'table lad: \n{table_lad}\n’) 


table lad - pd.DataFrame(('b^ 


4)) 
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Script 10.1: Example-10-2.py 
import wooldridge as woo 
import pandas as pd 

import statsmodels.formula.api as smf 


intdef = woo.dataWoo ('intdef') 


# linear regression of static model (Q function avoids conflicts with keywords): 
reg = smf.ols(formula-'i3 ~ Q("inf") + Q("def")', data-intdef) 
results - reg.fit() 


# print regression table: 

table - pd.DataFrame(('b': round(results.params, 4), 
'se': round(results.bse, 4), 
't': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4))) 

print(f'table: \n{table}\n’) 


Script 10.2: Example-Barium.py 
import wooldridge as woo 
import pandas as pd 
import matplotlib.pyplot as plt 
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barium = woo.dataWoo(’barium’ ) 
T = len(barium) 


# monthly time series starting Feb. 1978: 
barium.index = pd.date range(start-'1978-02', periods-T, freq-'M') 
print(f'barium["chnimp"].head(): \n{barium["chnimp"] .head()}\n’) 


# plot chnimp (default: index on the x-axis): 
plt.plot('chnimp', data-barium, color-'black', linestyle-'-') 
plt.ylabel('chnimp') 

plt.xlabel('time') 

plt.savefig('PyGraphs/Example-Barium.pdf') 


M — — — — — —- Script 10.3: Example-StockData.py 
import pandas datareader as pdr 
import matplotlib.pyplot as plt 


# download data for 
tickers - ['F'] 
Start date = '2014-01-01' 
end date = ‘2015-12-31’ 


Ford Motor Company) and define start and end: 


# use pandas datareader for the import: 
F data - pdr.data.DataReader(tickers, 'yahoo', start date, end date) 


# look at imported data: 
print(f'F data.head(): \n{F_data.head()}\n’) 
print(f'F data.tail(): \n{F_data.tail()}\n’) 


# time series plot of adjusted closing price: 
plt.plot('Close', data-F data, color=’black’, linestyle: 
plt.ylabel('Ford Close Price') 

plt.xlabel('time') 
plt.savefig('PyGraphs/Example-StockData.pdf') 


Script 10.4: Example-10-4.py 
import wooldridge as woo 


import pandas as pd 
import statsmodels.formula.api as smf 


fertil3 = woo.dataWoo('fertil3') 
T = len(fertil3) 


# define yearly time series beginning in 1913: 
fertil3.index - pd.date range(start-'1913', periods-T, freq-'Y').year 


# add all lags of ‘pe’ up to order 2: 
fertil3['pe lagl'] = fertil3['pe'].shift(1) 
fertil3['pe lag2'] - fertil3['pe'].shift(2) 


# linear regression of model with lags: 
reg = smf.ols(formula-'gfr - pe + pe lagl + pe lag2 + ww2 + pill’, data-fertil3) 
results = reg.fit() 


# print regression table: 

table - pd.DataFrame(('b': round(results.params, 4), 
'se': round(results.bse, 4), 
't': round(results.tvalues, 4), 
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'pval': round(results.pvalues, 4) }) 
print (f/table: \n{table}\n’) 


~~~ ~ Script 10.5: Example-10-4-cont .py — 
import wooldridge as woo 
import pandas as pd 

import statsmodels.formula.api as smf 


fertil3 - woo.dataWoo('fertil3') 
T = len(fertil3) 


# define yearly time series beginning in 1913: 
fertil3.index = pd.date range(start-'1913', periods-T, freq-'Y').year 


# add all lags of ‘pe’ up to order 2: 
fertil3['pe lagl'] - fertil3['pe'].shift(1) 
fertil3['pe lag2'] - fertil3['pe'].shift(2) 


# linear regression of model with lags: 
reg = smf.ols(formula-'gfr - pe + pe lagl + pe lag2 + ww2 + pill’, data=fertil3) 
results - reg.fit() 


# F test (HO: all pe coefficients are-0): 

['pe = 0’, 'pe lagl = 0’, 'pe lag2 = 0'] 
ults.f test (hypotheses1) 
ftestl.statistic[0] [0] 
ftestl.pvalue 


fstatl 
fpvall 


print(f'fstatl: (fstatl)Wn') 
print(f'fpvall: (fpvall)Wn') 


# calculating the LRP: 

b = results.params 

b pe tot = b['pe'] + b['pe lagl'] + b['pe lag2'] 
print(f'b pe tot: (b pe tot)in') 


t 


| lagl + pe lag2 = 0'] 
results.f test (hypotheses2) 
ftest2.statistic[0] [0] 
ftest2.pvalue 


print(f'fstat2: (fstat2)Wn') 
print(f'fpval2: (fpval2)Wn') 


E ——————————————————— Script 10.6: Example-10-7.py —-.—, 
import wooldridge as woo 
import numpy as np 

import pandas as pd 

import statsmodels.formula.api as smf 


hseinv = woo.dataWoo('hseinv') 
# linear regression without time trend: 


reg wot = smf.ols(formula-'np.log(invpc) ~ np.log(price)', data-hseinv) 
results wot - reg wot.fit() 


# print regression table: 
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table wot = pd.DataFrame({’b’: round(results wot.params, 4), 
'se': round(results wot.bse, 4), 
't': round(results wot.tvalues, 4), 
'pval': round(results wot.pvalues, 4)]) 
print(f'table wot: \n{table_wot}\n’) 


# linear regression with time trend (data set includes a time variable t): 
reg wt = smf.ols(formula-'np.log(invpc) ~ np.log(price) + t', data=hseinv) 
results wt - reg wt.fit() 


# print regression table: 
table wt = pd.DataFrame ((' round(results wt.params, 4), 
‘se’: round(results wt.bse, 4), 
/t': round(results wt.tvalues, 4), 
'pval': round(results wt.pvalues, 4)]) 
print(f'table wt: \n{table_wt}\n’) 


M Script 10.7: Example-10-11.py 
import wooldridge as woo 
import numpy as np 

import pandas as pd 
import statsmodels.formula.api as smf 


barium = woo.dataWoo('barium') 


# linear regression with seasonal effects: 
reg = smf.ols(formula-'np.log(chnimp) ~ np.log(chempi) + np.log(gas) +’ 
‘np.log(rtwex) + befile6 + affile6 + afdecé +’ 
‘feb + mar + apr + may + jun + jul +’ 
‘aug + sep + oct + nov + dec’, 
data=barium) 
results = reg. fit () 


ion table: 
= pd.DataFrame(('b': round(results.params, 4), 
‘si round(results.bse, 4), 
't': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4)}) 
print(f'table: \n{table}\n’) 


11. Scripts Used in Chapter 11 


~ Script 11.1: Example-11-4.py 
import wooldridge as woo 
import pandas as pd 

import statsmodels.formula.api as smf 


nyse = woo.dataWoo(’nyse’) 
nyse[/ret’] = nyse['return'] 


# add all lags up to order 3: 

nyse['ret lagl'] = nyse['ret'].shift(1) 
nyse[’ret_lag2’] = nyse['ret'].shift(2) 
nyse['ret lag3'] = nyse[’ret’]. shift (3) 


# linear regression of model with lags: 
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regi = smf. ‘ret ~ ret lagi', data-nyse) 
reg2 - smf. ret - ret lagl + ret lag2', data-nyse) 

reg3 - smf. ret - ret lagl + ret lag2 + ret lag3', data-nyse) 
resultsl - regl.fit() 

results2 - reg2.fit() 

results3 - reg3.fit() 


# print regression tables: 
tablel - pd.DataFrame(('b': round(resultsl.params, 4), 
‘se’: round(resultsl.bse, 4), 
't': round(resultsl.tvalues, 4), 
‘pval’: round(resultsl.pvalues, 4)}) 
print (f’tablel: \n{table1}\n’) 


table2 = pd.DataFrame(('b': round(results2.params, 4), 
‘se’: round(results2.bse, 4), 
't': round(results2.tvalues, 4), 
'pval': round(results2.pvalues, 4))) 
print(f'table2: \n{table2}\n’) 


table3 - pd.DataFrame(('b': round(results3.params, 4), 
'se': round(results3.bse, 4), 
't': round(results3.tvalues, 4), 
'pval': round(results3.pvalues, 4)]) 


print(f'tabl \n{table3}\n’) 


Script 11.2: Example-EffMkts.py 
import numpy as np 
import pandas as pd 
import pandas datareader as pdr 
import statsmodels.formula.api as smf 
import matplotlib.pyplot as plt 


# download data for ‘AAPL’ (= Apple) and define start and end: 
tickers [^ AAPL'] 

start date - '2007-12-31' 

end date = '2016-12-31" 


# use pandas datareader for the import: 
AAPL_data = pdr.data.DataReader(tickers, ‘yahoo’, start_date, end_date) 


# drop ticker symbol from column name: 
AAPL_data.columns = AAPL data.columns.droplevel(level-l) 


# calculate return as the log difference: 
AAPL data['ret'] = np.log(AAPL data['Adj Close’ ]) .diff() 


# time series plot of adjusted closing prices: 
plt.plot('ret', data-AAPL data, color-'black', linestyle-'-') 
plt.ylabel('Apple Log Returns’) 

plt.xlabel('time') 
plt.savefig('PyGraphs/Example-EffMkts.pdf') 


# linear regression of models with lags: 
AAPL data['ret lagl'] = AAPL_data[’ret’].shift (1) 
AAPL_data[’ret_lag2’] = AAPL data['ret'].shift(2) 
AAPL_data[’ret_lag3’] = AAPL data['ret'].shift(3) 


regl = smf.ols(formula-'ret ~ ret lagl', data=AAPL_data) 
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resultsl = regl.fit() 
results2 - reg2.fit() 
results3 - reg3.fit() 


# print regression tables: 

tablel = pd.DataFrame(('b': round(resultsl.params, 4), 
‘se’: round(resultsl.bse, 4), 
/t': round(resultsl.tvalues, 4), 
'pval': round(resultsl.pvalues, 4)]) 

print(f'tablel: \n{tablel}\n’) 


table2 = pd.DataFrame(('b': round(results2.params, 4), 
‘se’: round(results2.bse, 4), 
'/t': round(results2.tvalues, 4), 
‘pval’: round(results2.pvalues, 4)}) 
print(f'table2: \n{table2}\n’) 


table3 = pd.DataFrame(('b': round(results3.params, 4), 
'se': round(results3.bse, 4), 
't': round(results3.tvalues, 4), 
'pval': round(results3.pvalues, 4)]) 
print(f'table3: \n{table3}\n’) 


Script 11.3: Simulate-RandomWalk.py 
import numpy np 
import scipy.stats as stats 
import matplotlib.pyplot as plt 


# set the random seed: 
np.random.seed(1234567) 


# initialize plot: 

x range = np.linspace(0, 50, num=51) 
plt.ylim([-18, 18]) 

plt.xlim([O, 50]) 


4 loop over draws: 

for r in range(0, 30): 
4 i.i.d. standard normal shock: 
e = stats.norm.rvs(0, 1, size=51) 


# set first entry to 0 (gives y 0 = 0): 
e[0] = 0 


# random walk as cumulative sum of shocks: 
y 7 np.cumsum(e) 


# add line to graph: 
plt.plot(x range, y, color-'lightgrey', linestyle 


-") 


plt.axhline(linewidth-2, linestyle-'—-', color-'black') 
plt.ylabel('y') 

plt.xlabel('time') 
plt.savefig('PyGraphs/Simulate-RandomWalk.pdf') 


reg2 = smf.ols(formula-'ret ~ ret lagl + ret lag2', data-AAPL data) 
reg3 = smf.ols(formula-'ret ~ ret lagl + ret lag2 + ret lag3', data-AAPL data) 
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Script 11.4: Simulate-RandomWalkDrift.py 
import numpy as np 
import scipy.stats as stats 
import matplotlib.pyplot as plt 


# set the random seed: 
np. random. seed (1234567) 


# initialize plot: 
x_range = np.linspace(0, 50, num=51) 
plt.ylim([0, 100]) 
plt.xlim([0, 50]) 


# loop over draws: 

for r in range(0, 30): 
# i.i.d. standard normal shock: 
e = stats.norm.rvs(0, 1, size-51) 


# set first entry to 0 (gives y 0 = 0): 
e[0] = 0 


# random walk as cumulative sum of shocks plus drift: 
y = np.cumsum(e) + 2 + x range 


# add line to graph: 
plt.plot(x range, y, color-'lightgrey', linestyle 


^) 


plt.plot(x range, 2 + x range, linewidth-2, linestyle: 
plt.ylabel('y') 

plt.xlabel('time') 
plt.savefig('PyGraphs/Simulate-RandomWalkDrift.pdf') 


Script 11.5: Simulate-RandomWalkDrift-Diff.py 


import numpy as np 
import scipy.stats as stats 
import matplotlib.pyplot as plt 


4 set the random seed: 
np.random.seed (1234567) 


# initialize plot: 
x range = np.linspace(1, 50, num-50) 
plt.ylim([-1, 5]) 
plt.xlim([0, 50]) 


# loop over draws: 
for r in range(0, 30): 


# i.i.d. standard normal shock and cumulative sum of shocks: 


e - stats.norm.rvs(0, 1, size-51) 
e[0] = 0 
y = np.cumsum(2 + e) 


# first difference: 
Dy = y[1:51] - y[0:50] 


# add line to graph: 
plt.plot(x range, Dy, color-'lightgrey', linestyle-'-') 


plt.axhline(y-2, linewidth-2, linestyle-'—-', color-'black') 


-', colors'black') 


378 Python Scripts 


plt.ylabel('y') 
plt.xlabel('time') 
plt.savefig('PyGraphs/Simulate-RandomWalkDrift-Diff.pdf') 


~~ Script 11.6: Example-11-6 py — 
import wooldridge as woo 
import pandas as pd 

import statsmodels.formula.api as smf 


fertil3 = woo.dataWoo ('fertil3') 
T = len(fertil3) 


# define time series (years only) beginning in 1913: 
fertil3.index - pd.date range(start-'1913', periods-T, freq-'Y').year 


# compute first differences: 

fertil3['gfr diffl'] = fertil3['gfr'].diff() 
fertil3['pe diffl'] = fertil3['pe'].diff() 
print(f'fertil3.head(): \n{fertil3.head()}\n’) 


# linear regression of model with first differences: 
regl - smf.ols(formula-'gfr diffl - pe diffl', data-fertil3) 
resultsl - regl.fit() 


ion table 


round(resultsl.params, 4), 
round(resultsl.bse, 4), 
round(resultsl.tvalues, 4), 

^: round(resultsl.pvalues, 4)}) 
print(f'tablel: \n{table1}\n’) 


# linear regression of model with lagged differences: 
fertil3['pe diffl lagl'] fertil3['pe diffl'].shift(1) 
fertil3['pe diffl lag2'] fertil3['pe diffl'].shift(2) 


reg2 = smf.ols(formula-'gfr diffl ~ pe diffl + pe diffl lagl + pe diffl lag2', 
data=fertil3) 
results2 = reg2.fit() 


4 print regression table: 

table2 = pd.DataFrame(('b': round(results2.params, 4), 
‘se’: round(results2.bse, 4), 
't': round(results2.tvalues, 4), 
'pval': round(results2.pvalues, 4)}) 

print(f'table2: \n{table2}\n’) 
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L———————— — — — Script 12.1: Example-12-2-Static.py ——————————., 
import wooldridge as woo 
import pandas as pd 

import statsmodels.formula.api as smf 


phillips = woo.dataWoo('phillips') 
T = len(phillips) 
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# define yearly time series beginning in 1948: 
date range = pd.date_range(start='’1948’, periods=T, freq-'Y') 
phillips.index = date range.year 


# estimation of static Phillips curve: 

yt96 - (phillips['year'] 1996) 

reg s = smf.ols(formula-'Q("inf") ~ unem', data-phillips, subset-yt96) 
results s - reg s.fit() 


# residuals and AR(1) test: 

phillips['resid s'] = results_s.resid 

phillips['resid s lagl'] = phillips['resid s'].shift(1) 

reg = smf.ols(formula-'resid s ~ resid s lagl', data-phillips, subset-yt96) 
results - reg.fit() 


# print regression table: 
table - pd.DataFrame(('b': round(results.params, 4), 
'se': round(results.bse, 4), 
't': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4)]) 
print(f'table: \n{table}\n’) 


Script 12.2: Example-12-2-ExpAug.py 
import wooldridge as woo 
import pandas as pd 
import statsmodels.formula.api as smf 


phillips = woo.dataWoo ('phillips') 
T - len(phillips) 


# define yearly time series beginning in 1948: 
date range = pd.date range(start-'1948', periods-T, freq-'Y') 
phillips.index - date range.year 


# estimation of expectations-augmented Phillips curve: 

yt96 - (phillips['year'] «- 1996) 

phillips['inf diffl'] = phillips['inf'].diff() 

reg ea - smf.ols(formula-'inf diffl - unem', data-phillips, sub: 
results ea - reg ea.fit() 


phillips[’resid_ea’] = results ea.resid 

phillips['resid ea lagl'/] = phillips['resid ea'].shift(1) 

reg = smf.ols(formula-'resid ea ~ resid ea lagl', data-phillips, subset-yt96) 
results - reg.fit() 


# print regression table: 

table - pd.DataFrame(('b': round(results.params, 4), 
‘se’: round(results.bse, 4), 
't': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4)]) 

print(f'table: \n{table}\n’) 


I — — — —— Script 12.3: Example-12-4.py 
import wooldridge as woo 
import pandas as pd 

import numpy as np 

import statsmodels.api as sm 

import statsmodels.formula.api as smf 
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barium = woo.dataWoo(’barium’ ) 
T = len(barium) 


# monthly time series starting Feb. 1978: 
barium.index = pd.date range(start-'1978-02', periods=T, freq-'M') 


reg = smf.ols(formula-'np.log(chnimp) ~ np.log(chempi) + np.log(gas) +’ 
'np.log(rtwex) + befile6 + affile6 + afdec6', 
data-barium) 
results = reg.fit() 


# automatic test: 

bg result - sm.stats.diagnostic.acorr breusch godfrey(results, nlags-3) 
fstat auto - bg result[2] 

fpval auto - bg result[3] 

print(f'fstat auto: (fstat auto)in') 

print(f'fpval auto: (fpval auto)Wn') 


# pedestrian tes 
barium['resid'] = results.resid 

barium['resid lagl'] - barium['resid'] 
barium['resid lag2'] = barium['resid'] 
barium['resid lag3'] = barium['resid'] 


reg manual = smf.ols (formul: 


/np.log(chempi) + np.log(gas) + np.log(rtwex) +’ 
'befile6 + affile6 + afdec6’, data-barium) 
results manual = reg manual. fit () 


fstat_manual = ftest_manual 
fpval_manual = ftest_manual.pvalue 

print(f'fstat manual: {fstat_manual}\n’) 
print(f'fpval manual: (fpval manual)Wn') 


pM — — Script 124: Example-DWtest.py 
import wooldridge as woo 
import pandas as pd 
import statsmodels.api as sm 

import statsmodels.formula.api as smf 


phillips - woo.dataWoo('phillips') 
T - len(phillips) 


# define yearly time series beginning in 1948: 
date range = pd.date range(start-'1948', periods-T, freq-'Y') 
phillips.index - date range.year 


# estimation of both Phillips curve models: 

yt96 = (phillips['year'] <= 1996) 

phillips['inf diffl'] - phillips['inf'].diff() 

reg s = smf.ols(formula-'Q("inf") ~ unem', data-phillips, 
reg ea = smf.ols(formula-'inf diffl ~ unem', data-phillips, subset-yt96) 
results s - reg s.fit() 

results ea - reg ea.fit() 


4 DW tests: 
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DW s = sm.stats.stattools.durbin watson(results s.resid) 
DW ea - sm.stats.stattools.durbin watson(results ea.resid) 
print(f'DW s: (DW s)Wn') 

print(f'DW ea: (DW ea) n') 


import wooldridge as woo 
import pandas as pd 

import numpy as np 

import statsmodels.api as sm 
import patsy as pt 


barium = woo.dataWoo('barium') 
T = len(barium) 


# monthly time series starting Feb. 1978: 
barium.index = pd.date range(start-'1978-02', periods-T, freq-'M') 


# perform the Cochrane-Orcutt estimation (iterative procedure): 

y, X = pt.dmatrices('np.log(chnimp) ~ np.log(chempi) + np.log(gas) +’ 
‘np.log(rtwex) + befile6 + affile6 + afdecó', 
data-barium, return type-'dataframe') 

reg - sm.GLSAR(y, X) 

CORC results = reg.iterative_fit (maxiter=100) 

table = pd.DataFrame(('b CORC': CORC results.params, 

‘se CORC': CORC results.bse]) 
print(f'reg.rho: (reg.rho)n') 

print(f'table: \n{table}\n’) 


Script 12.6: Example-12-1.py 
import wooldridge 
import pandas 
import numpy 
import statsmodels.formula.api as smf 


woo 


prminwge = woo.dataWoo ('prminwge') 

T - len(prminwge) 

prminwge[’time’] = prminwge['year'] - 1949 
prminwge.index = pd.date range(start-'1950', period: 


, freq-'Y').year 


# OLS regression: 
reg = smf.ols(formula-'np.log(prepop) ~ np.log(mincov) + np.log(prgnp) +’ 
'np.log(usgnp) + time’, data-prminwge) 


# results with regular SE: 
results regu = reg. fit () 


# print regression table: 
table regu = pd.DataFrame({’b’: round(results regu.params, 4), 

‘se’: round(results regu.bse, 4), 

/t': round(results regu.tvalues, 4), 

‘pval’: round(results regu.pvalues, 4)}) 
print(f'table regu: \n{table_regu}\n’) 


# results with HAC SE: 
results hac - reg.fit(cov type-'HAC', cov kwds-('maxlags': 2)) 


# print regression table: 


Script 12.5: Example-12-5.py — 0. 
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table hac = pd.DataFrame({’b’: round(results hac.params, 4), 
‘se’: round(results hac.bse, 4), 
't': round(results hac.tvalues, 4), 
'pval': round(results hac.pvalues, 4)]) 
print(f'table hac: \n{table_hac}\n’) 


M ~ Script 12.7: Example-12-9.py 
import wooldridge as woo 
import pandas as pd 

import statsmodels.formula.api as smf 


nyse = woo.dataWoo('nyse') 
nyse[/ret’] = nyse['return'] 
nyse['ret lag1'] = nyse['ret'].shift(1) 


# linear regression of model: 
reg - smf.ols(formula-'ret - ret lagl', data-nyse) 
results - reg.fit() 


# squared residuals: 
nyse['resid sq'] = results.resid ++ 2 
nyse['resid sq lagl'] = nyse[’resid_sq’] .shift (1) 


# model for squared residuals: 
ARCHreg = smf.ols(formula-'resid sq - resid sq lagl', data-nyse) 
results ARCH - ARCHreg.fit() 


4 print regression tabl 
table = pd.DataFrame(('b' 


round(results_ARCH.params, 4), 
round (results_ARCH.bse, 4), 
't': round(results ARCH.tvalues, 4), 
'pval': round(results ARCH.pvalu: 
print(f'table: \n{table}\n’) 


Script 12.8: Example-ARCH.py 
import numpy as np 


import pandas as pd 
import pandas datareader as pdr 


import statsmodels.formula.api as smf 


# download data for ‘AAPL’ (= Apple) and define start and end: 
tickers = ['AAPL'] 

start date = ‘2007-12-31’ 

end date = '2016-12-31' 


# use pandas datareader for the import: 
AAPL data - pdr.data.DataReader(tickers, 'yahoo', start date, end date) 


# drop ticker symbol from column name: 
AAPL data.columns = AAPL_data.columns.droplevel (level=1) 


# calculate return as the difference of logged prices: 
AAPL data['ret'] = np.log(AAPL data['Adj Close']).diff() 
AAPL data['ret lagl'] = AAPL data['ret'].shift(1) 


# AR(1) model for returns: 
reg = smf.ols(formula-'ret ~ ret lagl', data-AAPL data) 
results - reg.fit() 
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# squared residuals: 
AAPL data['resid sq'] = results.resid ** 2 
AAPL data['/resid sq lagl'] = AAPL data['resid sq'].shift(1) 


# model for squared residuals: 
ARCHreg = smf.ols(formula-'resid sq - resid sq lagl', data-AAPL data) 
results ARCH - ARCHreg.fit() 


# print regression table: 
table - pd.DataFrame(('b': round(results ARCH.params, 4), 
‘se’: round(results ARCH.bse, 4), 
't': round(results ARCH.tvalues, 4), 
'pval': round(results ARCH.pvalues, 4) }) 
print(f'table: \n{table}\n’) 
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Script 13.1: Example-13-2.py 
import wooldridge as woo 
import pandas as pd 
import statsmodels.formula.api as smf 


cps78 85 = woo.dataWoo('cps78 85') 


# OLS results including interaction terms: 
reg = smf.ols(formula-'lwage ~ y85*(eductfemale) + exper +/ 
'I((exper**2)/100) + union’, 
data=cps78_85) 
results = reg. fit() 


# print regression tabl 
table - pd.DataFrame(('b 


round(results.params, 4), 
round(results.bse, 4), 
't': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4)]) 
print(f'table: \n{table}\n’) 


MM — — — —— Script 13.2: Example-13-3-1.py 
import wooldridge as woo 
import pandas as pd 

import statsmodels.formula.api as smf 


kielmc = woo.dataWoo ('kielmc') 


# separate regressions for 1978 and 1981: 

y78 = (kielmc['year'] == 1978) 

reg78 = smf.ols(formula-'rprice ~ nearinc', data-kielmc, subset=y78) 
results78 = reg78.fit () 


y81 = (kielmc[’ year’ ] 
reg81 = smf.ols (formul: 
results81 = reg81.fit () 


1981) 
rprice ~ nearinc’, data-kielmc, subset=y81) 


# joint regression including an interaction term: 
reg joint = smf.ols(formula-'rprice ~ nearinc * C(year)', data-kielmc) 
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results joint = reg joint. fit () 


# print regression tables: 
table 78 = pd.DataFrame({’b’: round(results78.params, 4), 
‘se’: round(results78.bse, 4), 
/t': round(results78.tvalues, 4), 
‘pval’: round(results78.pvalues, 4)}) 
print(f'table 78: \n{table_78}\n’) 


table 81 = pd.DataFrame(('b': round(results81.params, 4), 
‘se’: round(results81.bse, 4), 
/t': round(results81.tvalues, 4), 
‘pval’: round(results81.pvalues, 4)]) 
print (f/table_81: \n{table_81}\n’) 


table joint = pd.DataFrame(('b': round(results joint.params, 4), 
'se': round(results joint.bse, 4), 
't': round(results . 
'pval': round(results joint.pvalues, 4))) 
print(f'table joint: \n{table_joint}\n’) 


Script 13.3: Example-13-3-2.py 
import wooldridge as woo 
import numpy np 
import pandas as pd 
import statsmodels.formula.api as smf 


kielmc = woo.dataWoo('kielmc') 


# difference in difference (DiD): 
reg did = smf.ols(formula-'np.log(rprice) ~ nearinc*C(year)', data-kielmc) 
results did = reg did.fit() 


4 print regression table: 

table did - pd.DataFrame(('b': round(results did.params, 4), 
‘se’: round(results did.bse, 4), 
't': round(results did.tvalues, 4), 
'pval': round(results did.pvalues, 4)]) 

print(f'table did: \n{table_did}\n’) 


# DiD with control variables: 
reg didC = smf.ols(formula-'np.log(rprice) ~ nearinc*C(year) + age +’ 
'I(age**2) + np.log(intst) + np.log(land) +’ 
'np.log(area) + rooms + baths’, 
data-kielmc) 
results didC - reg didC.fit() 


# print regression table: 
table didC = pd.DataFrame(('b': round(results didC.params, 4), 
'se': round(results didC.bse, 4), 
't': round(results didC.tvalues, 4), 
'pval': round(results didC.pvalues, 4))) 
print(f'table didC: \n{table_didc}\n’) 


Script 13.4: Example-FD.py 


import wooldridge as woo 


import numpy as np 
import pandas as pd 
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import statsmodels.formula.api as smf 
import linearmodels as plm 


crime2 = woo.dataWoo (’crime2’) 


# create time variable dummy by converting a Boolean variable to an integer: 
crime2['t'] = (crime2['year'] 87).astype(int) # False-0, True=1 


# create an index in this balanced data set by combining two arrays: 
id tmp - np.linspace(1, 46, num-46) 
crime2['id'/] = np.sort(np.concatenate([id tmp, id tmp])) 


# manually calculate first differences per entity for crmrte and unem: 
crime2['crmrte diffl'] = V 

crime2.sort values(['id', 'year']).groupby('id')['crmrte'].diff() 
crime2['unem diffl'] = V 

crime2.sort values(['id', 'year']).groupby('id')['unem'].diff() 
var selection - ['id', 't', 'crimes', 'unem', 'crmrte diffl', 'unem diffl'] 
print(f'crime2[var selection].head(): \n{crime2[var_selection] .head()}\n’) 


# estimate FD model with statmodels on differenced data: 
reg sm = smf.ols(formula-'crmrte diffl ~ unem diffl', data=crime2) 
results sm - reg sm.fit() 


# print results 

table sm - pd.DataFrame(('b': round(results sm.params, 4), 
z round(results_sm.bse, 4), 
't': round (results_sm.tvalue: 
'pval': round(results_sm.pvalu 

print(f'table sm: \n{table_sm}\n’) 


# estimate FD model with linearmodels: 

crime2 - crime2.set index(['id', 'year']) 

reg plm = plm.FirstDifferenceOLS.from formula(formula-'crmrte ~ t + unem', 
data-crime2) 

results plm = reg plm.fit() 


# print results: 

table plm = pd.DataFrame({’b’: round(results plm.params, 4), 
‘se’: round(results plm.std errors, 4), 
't': round(results plm.tstats, 4) 
'pval': round(results plm.pvalues, 4))) 

print(f'table plm: Wn(table plm)in') 


p M — Script 13.5: Example-13-9.py 
import wooldridge as woo 
import numpy as np 

import linearmodels as plm 


crime4 
crime4 


= woo.dataWoo (' crime4') 
= crime4.set index(['county', 'year'], drop=False) 
# estimate FD model: 
reg - plm.FirstDifferenceOLS.from formula( 
formula=/np.log(crmrte) ~ year + d83 + d84 + d85 + d86 + d87 +’ 
/lprbarr + lprbconv + lprbpris + lavgsen + lpolpc', 
data=crime4) 
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results = reg. fit () 
print (f' results: \n{results}\n’) 


14. Scripts Used in Chapter 14 


M — — —— Script 14.1: Example-14-2.py 
import wooldridge as woo 
import pandas as pd 

import linearmodels as plm 


wagepan = woo.dataWoo (’ wagepan’ ) 
wagepan = wagepan.set index(['nr', 'year'], drop-False) 


# FE model estimation: 

reg - plm.PanelOLS.from formula( 
formula-'lwage ~ married + union + C(year)*educ + EntityEffects', 
data-wagepan, drop absorbed-True) 

results - reg.fit() 


# print regression table: 
table = pd.DataFrame(('b 


round(results.params, 4), 
si round(results.std errors, 4), 
't': round(results.tstats, 4), 
'pval': round(results.pvalues, 4)}) 
print(f'table: \n{table}\n’) 


Script 14.2: Example-14-4-1.py 
import wooldridge as woo 


wagepan = woo.dataWoo ('wagepan') 


# print relevant dimensions for panel: 

N = wagepan.shape[0] 

T = wagepan|' year'].drop duplicates ().shape[0] 
n = wagepan|['nr'].drop duplicates ().shape[0] 
print (£/N: (N)Wn') 

print (£/T: (T)Wn') 

print (f'n: (n)Wn') 


# check non-varying variables 


# (I) across time and within individuals by calculating individual 

# specific variances for each variable: 

isv nr = (wagepan.groupby(nr').var() == 0) # True, if variance is zero 
# choose variables where all grouped variances are zero: 

noVar_nr = isv_nr.all(axis=0) # which cols are completely True 
print(f'isv nr.columns[noVar nr]: \n{isv_nr.columns[noVar_nr]}\n’) 


# (II) across individuals within one point in time for each variable: 
isv t = (wagepan.groupby('year').var() == 0) 

noVar t = isv t.all(axis-0) 

print(f'isv t.columns[noVar t]: \n{isv_t.columns[noVar_t]}\n’) 


LL — — — — — Script 14.3: Example-14-4-2.py 
import wooldridge as woo 
import pandas as pd 
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import linearmodels as plm 


wagepan = woo.dataWoo(’wagepan’ ) 


# estimate different models: 
wagepan = wagepan.set index(['nr', 'year'], drop=False) 


reg ols = plm.PooledOLS.from formula( 
formula-'lwage ~ educ + black + hisp + exper + I(exper«42) +’ 
‘married + union + C(year)', data-wagepan) 
results ols = reg ols.fit() 


reg re - plm.RandomEffects.from formula( 
formula-'lwage ~ educ + black + hisp + exper + I(exper**2) +’ 
‘married + union + C(year)’, data=wagepan) 
results re = reg re.fit() 


reg fe - plm.PanelOLS.from formula( 
formula-'lwage ~ I(expers*2) + married + union + 
/C(year) + EntityEffects’, data-wagepan) 
results_fe = reg_fe.fit() 


# print results: 
theta_hat = results_re.theta.iloc[0, 0] 
print(f'theta hat: (theta hat)Wn') 


round(results ols.params, 4), 
std errors, 4), 


table ols = pd.DataFrame(('b^ 
'si 


"t 
'pval': round (resul 
print(f'table ols: \n{table_ols}\n’) 


table re = pd.DataFrame(('b': round(results re.params, 4), 
£ : round(results_re.std_errors, 4), 
't': round(results re.tstats, 4), 
'pval': round(results re.pvalues, 4)}) 
\n{table_re}\n’) 


print(f'table 
table fe - pd.DataFrame(('b': round(results fe.params, 4), 
'se': round(results fe.std errors, 4), 
't': round(results fe.tstats, 4), 
'pval': round(results fe.pvalues, 4))) 
print(f'table fe: \n{table_fe}\n’) 


— Script 14.4: Example-HausmTest.py 
import wooldridge as woo 
import numpy as np 

import linearmodels as plm 
import scipy.stats as stats 


woo. dataWoo (’ wagepan’ ) 
wagepan.set index(['nr', 'year'], drop-False) 


wagepan 
wagepan 


# estimation of FE and RE: 


data-wagepan) 
results fe = reg fe.fit() 
b fe - results fe.params 


reg fe = plm.PanelOLS.from formula(formula-'lwage ~ I(expers*2) + married +’ 
‘union + C(year) + EntityEffects', 
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b_fe_cov = results_fe.cov 


reg re = plm.RandomEffects.from formula( 
formula-'lwage ~ educ + black + hisp + exper + I(experx«2)’ 
‘+ married + union + C(year)', data=wagepan) 
results re = reg re.fit() 
b re - results re.params 
b re cov - results re.cov 


4 Hausman test of FE vs. RE 
# (I) find overlapping coefficients: 
Common coef - set(results fe.params.index).intersection(results re.params.index) 


# (II) calculate differences between FE and RE: 
b diff - np.array(results fe.params[common coef] - results re.params[common coef]) 
df len(b diff) 
b diff.reshape((df, 1)) 
b cov diff = np.array(b fe cov.loc[common coef, common coef] - 
b re cov.loc[common coef, common coef]) 
b cov diff.reshape((df, df)) 


# (III) calculate test statistic: 
stat = abs(np.transpose(b diff) @ np.linalg.inv(b cov diff) @ b diff) 
pval = 1 - stats.chi2.cdf(stat, df) 


print(f'stat: (stat)Wn') 
print(f'pval: (pval)Wn') 


Script 14.5: Example-Dummy-CRE-1.py 
import wooldridge as woo 
import pandas as pd 
import statsmodels.formula.api a. 
import linearmodels as plm 


smf 


wagepan = woo.dataWoo (' wagepan’ ) 
wagepan['t'] = wagepan['year'] 
wagepan['entity'] = wagepan['nr'] 
wagepan = wagepan.set index(['nr']) 


# include group specific means: 

wagepan['married b'] = wagepan.groupby ('nr').mean() ['married'] 
wagepan['union b'] = wagepan.groupby ('nr').mean() ['union'] 
wagepan - wagepan.set index(['year'], append-True) 


# estimate FE parameters in 3 different ways: 

reg we - plm.PanelOLS.from formula( 
formula-'lwage ~ married + union + C(t)*educ + EntityEffects', 
drop absorbed-True, data-wagepan) 

results we - reg we.fit() 


reg dum - smf.ols( 
formula-'lwage ~ married + union + C(t)*educ + C(entity)', 
data-wagepan) 

results dum - reg dum.fit() 


reg cre - plm.RandomEffects.from formula( 
formula-'lwage ~ married + union + C(t)*educ + married b + union b', 
data=wagepan) 

results cre = reg cre.fit() 
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# compare to RE estimates: 

reg re = plm.RandomEffects.from formula( 
formula-'lwage ~ married + union + C(t) *educ’, 
data=wagepan) 

results_re = reg_re.fit() 


var selection = ['married', ‘union’, 'C(t)[T.1982]:educ'] 


# print results: 

table = pd.DataFrame(('/b we': round(results we.params[var selection], 4), 
'/b dum’: round(results dum.params[var selection], 4), 
'/b cre': round(results cre.params[var selection], 4), 
'/b re': round(results re.params[var selection], 4)}) 

print(f'table: \n{table}\n’) 


p — — — Script 14.6: Example-CRE-test-RE.py 
import wooldridge as woo 
import linearmodels as plm 


wagepan = woo.dataWoo ('wagepan') 
wagepan['t'] = wagepan['year'] 
wagepan['entity'] = wagepan['nr'] 
wagepan - wagepan.set index(['nr']) 


# include group specific mean: 
wagepan['married b'] = wagepan.groupby (‘nr’) .mean() [/married’] 
wagepan['union b'] = wagepan.groupby ('nr').mean() [’ union’ ] 
wagepan = wagepan.set index(['year'], append=True) 


# estimate CRE: 

reg cre = plm.RandomEffects.from formula( 

lwage ~ married + union + C(t)*educ + married b + union b', 
data-wagepan) 

results cre = reg cre.fit() 


# RE test as an Wald test on the CRE specific coefficient: 
d test(formula-'married b - union b - 0') 
print(f'wtest: \n{wtest}\n’) 


pM — — — —— Script 14.7: Example-CRE-2.py 
import wooldridge as woo 
import pandas as pd 

import linearmodels as plm 


wagepan = woo.dataWoo ('wagepan') 
wagepan['t'] = wagepan['year'] 
wagepan['entity'] = wagepan['nr'] 
wagepan = wagepan.set index(['nr']) 


# include group specific means: 
wagepan['married b'] = wagepan.groupby('nr').mean() ['married'] 
wagepan['union b'] = wagepan.groupby('nr').mean() ['union'] 
wagepan - wagepan.set index(['year'], append-True) 


# estimate CRE paramters: 
reg = plm.RandomEffects.from formula( 
formula-'lwage ~ married + union + educ +’ 
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‘black + hisp + married b + union b', 
data-wagepan) 
results - reg.fit() 


# print regression tabl. 
table = pd.DataFrame(('b' 


round(results.params, 4), 
‘se’: round(results.std errors, 4), 
't': round(results.tstats, 4), 
'pval': round(results.pvalues, 4)}) 
print(f'table: \n{table}\n’) 


L——————————————— — Script 14.8: Example-13-9-ClSE.py ————____ 
import wooldridge as woo 
import numpy as np 
import pandas as pd 
import linearmodels as plm 


crime4 
crime4 


woo.dataWoo(’crime4’) 
crime4.set index(['county', 'year'], drop-Fals. 


# estimate FD model: 
reg = plm.FirstDifferenceOLS.from formula( 
formula-'np.log(crmrte) ~ year + d83 + d84 + d85 + d86 + d87 +/ 
/lprbarr + lprbconv + lprbpris + lavgsen + lpolpc', 
data=crime4) 


standard SE: 
reg. fit () 


"clustered" SE: 
reg.fit(cov typi 
deb. 


clustered’, cluster entity-True, 
False) 


"clustered" SE (small-sample correction): 
.fit(cov type-'clustered', cluster entity-True) 


# print results: 

table - pd.DataFrame(('b': round(results default.params, 4), 
'se default': round(results default.std errors, 4), 
'se cluster': round(results cluster.std errors, 4), 
'se css': round(results css.std errors, 4)}) 

print(f'table: \n{table}\n’) 


15. Scripts Used in Chapter 15 


Script 15.1: Example-15-1.py 


import wooldridge as woo 

import numpy as np 

import pandas as pd 

import linearmodels.iv as iv 

import statsmodels.formula.api as smf 


mroz = woo.dataWoo('mroz') 


# restrict to non-missing wage observations: 
mroz = mroz.dropna(subset=[’ lwage’ ]) 


15. Scripts Used in Chapter 15 391 


cov yz = np.cov(mroz['lwage'], mroz['fatheduc'])[1, 0] 
cov xy = np.cov(mroz['educ'], mroz['lwage'])[1, 0] 
cov xz - np.cov(mroz['educ'], mroz['fatheduc'])[1, 0] 


var x - np.var(mroz['educ'], ddof-1) 
x bar = np.mean (mroz [' educ’ ]) 
y_bar = np.mean(mroz[’ lwage’ ]) 


# OLS slope parameter manually: 
b_ols_man = cov_xy / var_x 
print(f'b ols man: {b_ols_man}\n’) 


# IV slope parameter manually: 
b iv man = cov yz / cov xz 
print(f'b iv man: (b iv man)Wn') 


# OLS automatically: 
reg ols = smf.ols(formula-'np.log(wage) ~ educ', data=mroz) 
results ols - reg ols.fit() 


# print regression table: 
table ols = pd.DataFrame(('b': round(results ols.params, 4), 

'se': round(results ols.bse, 4), 

't': round(results ols.tvalues, 4), 

'pval': round(results ols.pvalues, 4))) 
print(f'table ols: \n{table_ols}\n’) 


# IV automatically: 

reg iv = iv.IV2SLS.from formula(formula-'np.log(wage) ~ 1 + [educ ~ fatheduc]', 
data=mroz) 

results iv = reg iv.fit(cov type-'unadjusted', debi: 


True) 


# print regression table: 

table iv = pd.DataFrame(('b': round(results iv.params, 4), 
^ round(results iv.std errors, 4), 
't': round(results iv.tstats, 4), 
/pval': round(results iv.pvalues, 4)}) 

print(f'table iv: \n{table_iv}\n’) 


Script 15.2: Example-15-4.py 
import wooldridge as woo 
import numpy as np 
import pandas as pd 
import linearmodels.iv as iv 
import statsmodels.formula.api as smf 


card = woo.dataWoo('card') 


# checking for relevance with reduced form: 

reg redf = smf.ols( 
formula-'educ ~ nearc4 + exper + I(exper««2) + black + smsa + 
‘south + smsa66 + reg662 + reg663 + reg664 + reg665 + reg666 +’ 
'reg667 + reg668 + reg669', data-card) 

results redf = reg redf.fit() 


4 print regression table: 
table redf = pd.DataFrame(('b': 
aal 
n 


round(results redf.params, 4), 
round(results redf.bse, 4), 
: round(results redf.tvalues, 4), 
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'pval': round(results redf.pvalues, 4) }) 
print(f'table redf: \n{table_redf}\n’) 


# OLS: 

reg ols = smf.ols( 
formula-'np.log(wage) ~ educ + exper + I(exper««2) + black + smsa +’ 
‘south + smsa66 + reg662 + reg663 + reg664 + reg665 +’ 
'reg666 + reg667 + reg668 + reg669', data-card) 

results ols = reg ols.fit() 


# print regression table: 

table ols - pd.DataFrame(('b': round(results ols.params, 4), 
‘se’: round(results ols.bse, 4), 
't': round(results ols.tvalues, 4), 
'pval': round(results ols.pvalues, 4)]) 


print(f'table ol \n{table_ols}\n’) 


# IV automatically: 
iv.IV2SLS.from formula( 
np.log(wage)~ 1 + exper + I(exper««2) + black + smsa + ’ 
‘south + smsa66 + reg662 + reg663 + reg664 + reg665 +’ 
'reg666 + reg667 + reg668 + reg669 + [educ ~ nearc4]', 
data=card) 
results iv = reg iv.fit(cov type-'unadjusted', debiased-True) 


# print regression tabl 
table iv = pd.DataFrame(('b 


round(results iv.params, 4), 
: round(results iv.std errors, 4), 
/t': round(results iv.tstats, 4), 

‘pval’: round(results iv.pvalues, 4))) 


print(f'table iv: Mn(table iv) Tn') 


Script 15.3: Example-15-5.py 
import wooldridge as woo 
import numpy as np 
import pandas as pd 
import linearmodels.iv as iv 
import statsmodels.formula.api as smf 


mroz = woo.dataWoo('mroz') 


# restrict to non-missing wage observations: 
mroz = mroz.dropna (subset-['lwage']) 


# 1st stage (reduced form): 
reg redf = smf.ols(formula-'educ ~ exper + I(exper««2) + motheduc + fatheduc', 


data-mroz) 
results redf - reg redf.fit() 
mroz['educ fitted'] - results redf.fittedvalues 


# print regression table: 

table redf = pd.DataFrame(('b': round(results redf.params, 4), 
‘se’: round(results redf.bse, 4), 
'/t': round(results redf.tvalues, 4), 
'pval': round(results redf.pvalues, 4) }) 

print(f'table redf: \n{table_redf}\n’) 


# 2nd stage: 
reg secstg = smf.ols(formula-'np.log(wage) ~ educ fitted + exper + I(exper««2)', 


15. Scripts Used in Chapter 15 393 


data=mroz) 
results secstg = reg secstg.fit() 


# print regression table: 

table secstg = pd.DataFrame(('b': round(results secstg.params, 4), 
‘se’: round(results secstg.bse, 4), 
/t': round(results secstg.tvalues, 4), 
'pval': round(results secstg.pvalues, 4)}) 

print(f'table secstg: \n{table_secstg}\n’) 


# IV automatically: 
reg iv - iv.IV2SLS.from formula( 
formula-'np.log(wage) ~ 1 + exper + I(exper««2) +/ 
' [educ ~ motheduc + fatheduc]', 
data-mroz) 
results iv = reg iv.fit(cov type-'unadjusted', debiased-True) 


# print regression table: 

table iv = pd.DataFrame(('b': round(results iv.params, 4), 
'se': round(results iv.std errors, 4), 
't': round(results iv.tstats, 4), 
'pval': round(results iv.pvalues, 4)}) 

\n{table_iv}\n’) 


print(f'table i: 


p — —— Script 15.4: Example-15-7.py 
import wooldridge as woo 
import numpy as np 

import pandas as pd 

import statsmodels.formula.api as smf 


mroz = woo.dataWoo ('mroz') 


# restrict to non-missing wage observations: 
mroz = mroz.dropna (subset-['lwage']) 


# 1st stage (reduced form): 

reg redf = smf.ols(formula-'educ ~ exper + I(exper««2) + motheduc + fatheduc', 
data-mroz) 

results redf - reg redf.fit() 

mroz['resid'] - results redf.resid 


# 2nd stage: 

reg secstg = smf.ols(formula-'np.log(wage)- resid + educ + exper + I(expers«2)', 
data=mroz) 

results secstg = reg_secstg. fit () 


# print regression table: 

table secstg = pd.DataFrame(('b': round(results secstg.params, 4), 
'se': round(results secstg.bse, 4), 
/t': round(results secstg.tvalues, 4), 
‘pval’: round(results secstg.pvalues, 4)]) 

print(f'table secstg: \n{table_secstg}\n’) 


Script 15.5: Example-15-8.py 
import wooldridge as woo 
import numpy as np 
import pandas as pd 
import linearmodels.iv as iv 
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import statsmodels.formula.api as smf 
import scipy.stats as stats 


mroz = woo.dataWoo ('mroz') 


# restrict to non-missing wage observations: 
mroz = mroz.dropna (subset-['lwage']) 


# IV regression: 

reg iv = iv.IV2SLS.from formula(formula-'np.log(wage) ~ 1 + exper + I(exper««2) +’ 
'[educ ~ motheduc + fatheduc]’, data-mroz) 

results iv = reg_iv.fit (cov_type=’ unadjusted’, debiased=True) 


# print regression table: 

table iv = pd.DataFrame(('b': round(results iv.params, 4), 
‘se’: round(results iv.std errors, 4), 
/t': round(results iv.tstats, 4), 
‘pval’: round(results iv.pvalues, 4))) 

print(f'table iv: \n{table_iv}\n’) 


# auxiliary regression: 

mroz['/resid iv/] = results iv.resids 

reg aux = smf.ols(formula-'resid iv ~ exper + I(exper««2) + motheduc + fatheduc’, 
data=mroz) 


results aux = reg aux.fit() 


# calculations for test: 
r2 = results aux.rsquared 


teststat 
pval = 1 - stats.chi2.cdf(teststat, 1) 


print(f'r2: (r2)Wn') 

print(f'n: {n}\n’) 
print(f'teststat: {teststat}\n’) 
print(f'pval: {pval}\n’) 


Script 15. 


_ ; Example-15-10.py 
import wooldridge as woo 
import pandas as pd 
import linearmodels.iv as iv 


jtrain = woo.dataWoo('jtrain') 
# define panel data (for 1987 and 1988 only): 


jtrain 87 88 - jtrain.loc[(jtrain['year'] -- 1987) | (jtrain['year'] -- 1988), :] 
jtrain 87 88 - jtrain 87 88.set index(['fcode', 'year']) 


# manual computation of deviations of entity means: 
jtrain 87 88['lscrap diffl'] = V 

jtrain 87 88.sort values(['fcode', 'year']).groupby('fcode') ['lscrap'].diff() 
jtrain 87 88['hrsemp diffl'] = V 

jtrain 87 88.sort values(['fcode', 'year']).groupby('fcode') ['hrsemp'].diff() 
jtrain 87 88['grant diffl'] = V 

jtrain 87 88.sort values(['fcode', 'year']).groupby('fcode') ['grant'].diff() 


# IV regression: 
reg iv = iv.IV2SLS.from formula( 
formula-'lscrap diffl ~ 1 + [hrsemp diffl ~ grant diffl]', 
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jtrain_87_88) 
results iv = reg iv.fit(cov type-'unadjusted', debiased=True) 


# print regression table: 

table iv = pd.DataFrame(('b': round(results iv.params, 4), 
‘se’: round(results iv.std errors, 4), 
't': round(results iv.tstats, 4), 
'pval': round(results iv.pvalues, 4))) 

\n{table_iv}\n’) 


print(f'table i 


16. Scripts Used in Chapter 16 


pM — — Script 16.1: Example-16-5-2SLS.py 
import wooldridge as woo 
import numpy as np 
import pandas as pd 
import linearmodels.iv as iv 


mroz = woo.dataWoo ('mroz') 


# restrict to non-missing wage observations: 
mroz = mroz.dropna (subset-['lwage']) 


# 2SLS regressions: 
reg ivl = iv.IV2SLS.from formula( 

‘hours ~ 1 + educ + age + kidslt6 + nwifeinc +’ 

' [np.log (wage) ~ exper + I(exper**2)]’, data-mroz) 
results ivl = reg ivl.fit(cov type-'unadjusted', debiased-Tru 


reg iv2 - iv.IV2SLS.from formula( 
'np.log(wage) ~ 1 + educ + exper + I(exper**2) + 
' [hours ~ age + kidslt6 + nwifeinc]', data-mroz) 
results iv2 - reg iv2.fit(cov type-'unadjusted', debiased-True) 


# print results: 

table ivl = pd.DataFrame(('b': round(results ivl.params, 4), 
‘se’: round(results ivl.std errors, 4), 
't': round(results ivl.tstats, 4), 
'pval': round(results ivl.pvalues, 4))) 

print(f'table ivi: W(table ivl)Wn') 


table iv2 - pd.DataFrame(('b': round(results iv2.params, 4), 
'se': round(results iv2.std errors, 4), 
'/t': round(results iv2.tstats, 4), 
'pval': round(results iv2.pvalues, 4)}) 
print(f'table iv2: \n{table_iv2}\n’) 


cor ulu2 = np.corrcoef(results ivl.resids, results iv2.resids)[0, 1] 
print(f'cor ulu2: (cor ulu2)in') 


I — — — — —— Script 16.2: Example-16-5-3SLS.py 
import wooldridge as woo 
import numpy as np 

import linearmodels.system as iv3 


mroz = woo.dataWoo('mroz') 
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# restrict to non-missing wage observations: 
mroz = mroz.dropna(subset=[’ lwage’ ]) 


# 3SLS regressions: 


formula = ('egl': ‘hours ~ 1 + educ + age + kidslt6 + nwifeinc +’ 
'[np.log(wage) ~ exper*l(exper**2)]', 
'eq2': 'np.log(wage) ~ 1 + educ + exper + I(exper««2) + 


'[hours ~ age + kidslt6 + nwifeinc]’} 
reg 3sls = iv3.IV3SLS.from_formula(formula, data-mroz) 


results 3sls - reg 3sls.fit(cov type-'unadjusted', debiased-True) 
print(f'results 3sls: \n{results_3sls}\n’) 


17. Scripts Used in Chapter 17 


p — — —— Script 17.1: Example-17-1-1.py 
import wooldridge as woo 
import pandas as pd 

import statsmodels.formula.api as smf 


mroz = woo.dataWoo('mroz') 


# estimate linear probability model: 
reg lin = smf.ols(formula-'inlf ~ nwifeinc + educ + exper +’ 
'/I(exper**2) + age + kidslt6 + kidsge6', 
data=mroz) 
results lin = reg lin.fit(cov type-'HC3') 


4 print regression tabl 
table - pd.DataFrame(('b': round(results lin.params, 4), 
: round(results lin.bse, 4), 
't': round(results lin.tvalues, 4), 
'pval': round(results lin.pvalues, 4)}) 
print(f'table: \n{table}\n’) 


Script 17.2: Example-17-1-2.py - 
import wooldridge as woo 

import pandas as pd 

import statsmodels.formula.api as smf 


mroz = woo.dataWoo('mroz') 


# estimate linear probability model: 
reg lin = smf.ols(formula-'inlf ~ nwifeinc + educ + exper +’ 
'I(exper++2) + age + kidslt6 + kidsge6', 
data=mroz) 
results lin = reg lin.fit(cov type-'HC3') 


# predictions for two "extreme" women: 

X new - pd.DataFrame( 
('nwifeinc': [100, 0], 'educ': [5, 17], 
'exper': [0, 30], 'age': [20, 52], 
'kidslt6': [2, 0], 'kidsge6': [0, 0])) 

predictions - results lin.predict(X new) 
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print (f' predictions: \n{predictions}\n’) 


Script 17.3: Example-17-1-3 py — 
import wooldridge as woo 
import statsmodels.formula.api as smf 


mroz = woo.dataWoo ('mroz') 


# estimate logit model: 
reg logit = smf.logit(formula-'inlf ~ nwifeinc + educ + exper +’ 
'I(exper**2) + age + kidslt6 + kidsge6', 
data=mroz) 


# disp = 0 avoids printing out information during the estimation: 
results logit = reg logit. fit (disp=0) 
print(f'results logit.summary(): \n{results_logit .summary()}\n’) 


# log likelihood value: 
print(f'results logit.11f: (results logit.llf)Wn') 


# McFadden's pseudo R2: 
print(f'results logit.prsquared: (results logit.prsquared)in') 


import wooldridge as woo 
import statsmodels.formula.api as smf 


- Script 17.4: Example-17-1-4.py ————_________ 


mroz = woo.dataWoo ('mroz') 


# estimate probit model: 
reg probit = smf.probit(formula-'inlf ~ nwifeinc + educ + exper +' 
‘I(expers*2) + age + kidslt6 + kidsge6', 
data=mroz) 
results probit = reg_probit. fit (disp=0) 
print (f/ results_probit.summary(): WMn(results probit.summary())Wn') 


# log likelihood value: 
print (f’ results_probit.11f: (results_probit.11£}\n’) 


# McFadden’s pseudo R2: 
print (f’ results_probit .prsquared: (results probit.prsquared)Wn') 


ooo Script 17.5: Example-17-1-5 .py — 
import wooldridge as woo 
import statsmodels.formula.api as smf 
import scipy.stats as stats 


mroz = woo.dataWoo('mroz') 


# estimate probit model: 
reg probit = smf.probit(formula-'inlf ~ nwifeinc + educ + exper +! 
'I(exper++2) + age + kidslt6 + kidsge6’, 
data=mroz) 
results probit = reg_probit. fit (disp=0) 


# test of overall significance (test statistic and pvalue) : 
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llrl manual = 2 * (results probit.llf - results probit.llnull) 
print(f'llrl manual: (llrl manual)Wn') 

print(f'results probit.llr: (results probit.llr)in') 
print(f'results probit.llr pvalue: (results probit.llr pvalue)n') 


# automatic Wald test of HO (experience and age are irrelevant): 
hypotheses - ['exper-0', 'I(exper ** 2)-0', 'age-0'] 

waldstat = results probit.wald test (hypotheses) 

teststat2 autom - waldstat.statistic 

pval2 autom = waldstat.pvalue 

print(f'teststat2 autom: {teststat2_autom}\n’) 

print(f'pval2 autom: {pval2_autom}\n’) 


# manual likelihood ratio statistic test 
# of HO (experience and age are irrelevant): 
reg probit restr = smf.probit(formula-'inlf ~ nwifeinc + educ +’ 
'kidslt6 + kidsge6', 
data-mroz) 
results probit restr - reg probit restr.fit(disp-0) 


llr2 manual = 2 * (results probit.llf - results probit restr.llf) 
pval2 manual - 1 - stats.chi2.cdf(llr2 manual, 3) 

print(f'llr2 manual2: (llr2 manual)|n') 

print(f'pval2 manual2: (pval2 manual)in') 


Script 17.6: Example-17-1-6.py 
import wooldridge as woo 
import pandas as pd 
import statsmodels.formula.api as smf 


mroz = woo.dataWoo('mroz') 


# estimate model 
reg lin = smf.ols(formula-'inlf ~ nwifeinc + educ + exper +’ 
‘I(expers#2) + age + kidslt6 + kidsge6', 
data=mroz) 
results lin = reg lin.fit(cov type-'HC3') 


reg logit = smf.logit(formula-'inlf ~ nwifeinc + educ + exper +’ 
'I(exper++2) + age + kidslt6 + kidsge6', 
data=mroz) 
results logit = reg_logit . fit (disp=0) 


reg probit = smf.probit(formula-'inlf ~ nwifeinc + educ + exper +’ 
‘I(expers*2) + age + kidslt6 + kidsge6', 
data=mroz) 
results probit = reg _probit . fit (disp=0) 


# predictions for two "extreme" women: 
X new = pd.DataFrame( 
('nwifeinc': [100, 0], 'educ': [5, 17], 

'exper': [0, 30], 'age': [20, 52], 

'kidslt6': [2, 0], 'kidsge6': [0, 0])) 
predictions lin - results lin.predict(X new) 
predictions logit - results logit.predict(X new) 
predictions probit - results probit.predict(X new) 


print(f'predictions lin: \n{predictions_lin}\n’) 
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print (£’predictions_logit: \n{predictions_logit}\n’) 
print (£’predictions_probit: \n{predictions_probit}\n’) 


~~~ — Script 17.7: Binary-Predictions.py — 
import pandas as pd 
import numpy as np 
import scipy.stats as stats 

import statsmodels.formula.api as smf 
import matplotlib.pyplot as plt 


# set the random seed: 
np. random. seed (1234567) 


y = stats.binom.rvs(1, 0.5, size=100) 
norm.rvs(0, 1, size-100) + 2 + y 
pd.DataFrame(('y': y, 'x': x)) 


# estimation: 

reg lin - smf.ols(formula-'y - x', data: 
results lin - reg lin.fit() 

reg logit = smf.logit(formula-'y ~ x’, data-sim data) 
results logit = reg logit.fit (disp=0) 

reg probit = smf.probit(formula-'y ~ x’, data-sim data) 
results probit = reg probit.fit (disp=0) 


im data) 


# prediction for regular grid of x valu 
X new = pd.DataFrame(('x': np.linspace(min(x), max(x), 50)}) 
predictions lin = results lin.predict(X new) 

predictions logit = results logit.predict(X new) 

predictions probit - results probit.predict(X new) 


# scatter plot and fitted values: 
plt.plot(x, y, color=’grey’, mark 
plt.plot(X new['x'], predictions lin, 

color-'black', linestyle-'-.', label=’ linear’) 
plt.plot(X new['x'], predictions logit, 

color=’black’, linestyle-'-', linewidth-0.5, label-'logit') 
plt.plot(X new['x'], predictions probit, 

color=’black’, linestyle-'--', label=’probit’) 
plt.ylabel (‘y’) 
plt.xlabel('x') 
plt.legend() 
plt.savefig('PyGraphs/Binary-Predictions.pdf') 


~ Script 17.8: Binary-Margeff.py — 
import pandas as pd 
import numpy as np 

import statsmodels.formula.api as smf 
import matplotlib.pyplot as plt 
import scipy.stats as stats 


# set the random seed: 
np. random. seed (1234567) 


y = stats.binom.rvs(1, 0.5, size=100) 
x = stats.norm.rvs(0, 1, size=100) + 2 + y 
sim data = pd.DataFrame(('y': y, 'x': x}) 
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# estimation: 

reg lin = smf.ols(formula-'y ~ x’, data-sim data) 
results lin - reg lin.fit() 

reg logit = smf.logit(formula-'y ~ x’, data-sim data) 
results logit = reg logit.fit (disp=0) 

reg probit - smf.probit(formula-'y - x', data-sim data) 
results probit = reg probit.fit(disp-0) 


# calculate partial effects: 
PE lin = np.repeat(results lin.params['x'], 100) 


xb logit = results logit.fittedvalues 
factor logit - stats.logistic.pdf(xb logit) 
PE logit - results logit.params['x'] * factor logit 


xb probit = results probit.fittedvalues 
factor probit = stats.norm.pdf(xb probit) 
PE probit = results probit.params['x'] + factor probit 


# plot APE's: 
plt.plot(x, PE lin, color-'black', 


marker-'o', linestyle-'', label-'linear') 
plt.plot(x, PE logit, color-'black', 
marker-'*', linestyle-'', label-'logit') 


plt.plot(x, PE probit, color-'black', 
marker='*’, linestyle=’’, label-'probit') 

plt.ylabel('partial effects’) 

plt.xlabel('x') 

plt.legend() 

plt.savefig('PyGraphs/Binary-margeff.pdf') 


Script 17.9: Example-17-1-7.py 
import wooldridge as woo 
import pandas as pd 
import numpy as np 
import statsmodels.formula.api as smf 
import scipy.stats as stats 


mroz = woo.dataWoo('mroz') 


# estimate model 

reg lin = smf.ols(formula-'inlf ~ nwifeinc + educ + exper + I(exper«42) +’ 
‘age + kidslt6 + kidsge6’, data=mroz) 

results lin = reg lin.fit(cov type-'HC3') 


reg logit = smf.logit(formula-'inlf ~ nwifeinc + educ + exper + I(exper««2) +’ 
‘age + kidslt6 + kidsge6', data-mroz) 
results logit = reg logit.fit (disp=0) 


reg probit = smf.probit(formula-'inlf ~ nwifeinc + educ + exper + I(exper++2) +’ 
‘age + kidslt6 + kidsge6’, data-mroz) 
results probit = reg probit. fit (disp=0) 


# manual average partial effects: 
APE_lin = np.array(results_lin.params) 


xb logit = results logit.fittedvalues 
factor logit - np.mean(stats.logistic.pdf(xb logit)) 
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APE_logit_manual = results_logit.params * factor_logit 


xb probit = results probit.fittedvalues 
factor probit - np.mean(stats.norm.pdf(xb probit)) 
APE probit manual = results probit.params * factor probit 


table manual = pd.DataFrame(('APE lin': np.round(APE lin, 4), 
/APE logit manual': np.round(APE logit manual, 4), 
/APE probit manual': np.round(APE probit manual, 4))) 
print(f'table manual: \n{table_manual}\n’) 


# automatic average partial effects: 
coef names = np.array(results lin.model.exog names) 
coef names = np.delete(coef names, 0) # drop Intercept 
APE logit autom = results logit.get margeff().margeff 
APE probit autom - results probit.get margeff().margeff 


table auto = pd.DataFrame(('coef names': coef names, 
/APE logit autom': np.round(APE logit autom, 4), 
/APE probit autom': np.round(APE probit autom, 4)]) 
print(f'table auto: \n{table_auto}\n’) 


p ——————————————————— Script 17.10: Example-17-3.py —— 
import wooldridge as woo 
import pandas as pd 
import statsmodels.api as sm 

import statsmodels.formula.api as smf 


crimel = woo.dataWoo(’crimel’) 


# estimate linear model: 
reg lin = smf.ols(formula-'narr86 ~ pcnv + avgsen + tottime + ptime86 +’ 
'qemp86 + inc86 + black + hispan + born60’, 
data-crimel) 
results lin - reg lin.fit() 


round(results lin.params, 4), 
round(results lin.bse, 4), 
't': round(results lin.tvalues, 4), 
'pval': round(results lin.pvalues, 4))) 
print(f'table lin: \n{table_lin}\n’) 


# estimate Poisson model: 
reg poisson = smf.poisson(formula-'narr86 ~ pcnv + avgsen + tottime +’ 
'ptime86 + qemp86 + inc86 + black +’ 
'hispan + born60’, 
data-crimel) 
results poisson = reg poisson.fit(disp-0) 


# print regression table: 
table poisson = pd.DataFrame({’b’: round(results poisson.params, 4), 
‘se’: round(results poisson.bse, 4), 
't': round(results poisson.tvalues, 4), 
'pval': round(results poisson.pvalues, 4))) 
print(f'table poisson: \n{table_poisson}\n’) 


# estimate Quasi-Poisson model: 
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reg qpoisson = smf.glm(formula-'narr86 - pcnv + avgsen + tottime + ptime86 +’ 
'qemp86 + inc86 + black + hispan + born60', 
family=sm. families .Poisson(), 
data-crimel) 
# the argument scale controls for the dispersion in exponential dispersion models, 
# see the module documentation for more details: 
results qpoisson - reg qpoisson.fit(scale-'X2', disp-0) 


# print regression table: 
table qpoisson - pd.DataFrame(('b': round(results qpoisson.params, 4), 
‘se’: round(results qpoisson.bse, 4), 
't': round(results qpoisson.tvalues, 4), 
'pval': round(results qpoisson.pvalues, 4))) 
print(f'table qpoisson: \n{table_qpoisson}\n’) 


— Script 17.11: 


- Tobit-CondMean.py — 
import numpy as np 
import matplotlib.pyplot as plt 
import scipy.stats as stats 


# set the random seed: 
np.random.seed(1234567) 


x = np.sort(stats.norm.rvs(0, 1, size-100) + 4) 
xb--4*1l«*x 
y_star = xb + 


ats.norm.rvs(0, 1, siz 00) 


Y 
yly_star « 0] 


4 conditional means: 
Eystar - xb 


norm.cdf(xb / 1) * xb + 1 * stats.norm.pdf(xb / 1) 


# plot data and conditional means: 

plt.axhline(y-0, linewidth-0.5, 

=", color-'grey') 

plt.plot(x, y star, color-'black', 
markerz'x', linestyle-'', label='y*') 

plt.plot(x, y, color=’black’, marker='+', 
linestyle=’’, label=’y’) 

plt.plot (x, Eystar, color-'black', marker-'', 
linestyle-'-', label-'E(y*)') 

plt.plot(x, Ey, color-'black', marker-'', 
linestyle-'--', label-'E(y)') 

plt.ylabel('y') 

plt.xlabel('x') 

plt.legend() 

plt.savefig('PyGraphs/Tobit-CondMean.pdf') 


Script 17.12: Example-17-2.py 
import wooldridge as woo 

import numpy as np 

import patsy as pt 

import scipy.stats as stats 

import statsmodels.formula.api as smf 

import statsmodels.base.model as smclass 


mroz = woo.dataWoo('mroz') 
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y, X = pt.dmatrices(/hours ~ nwifeinc + educ + exper +’ 
/I(exper**2)* age + kidslt6 + kidsge6', 
data-mroz, return type-'dataframe') 


# generate starting solution: 
reg ols = smf.ols(formula-'hours ~ nwifeinc + educ + exper + I(expere«2) +’ 
‘age + kidslt6 + kidsge6’, data=mroz) 
results ols = reg ols.fit() 
sigma start = np.log(sum(results ols.resid ** 2) / len(results ols.resid)) 
params start - np.concatenate((np.array(results ols.params), sigma start), 
axis-None) 


# extend statsmodels class by defining nloglikeobs: 
class Tobit (smclass.GenericLikelihoodModel): 
# define a function that returns the negative log likelihood per observation 
# for a set of parameters that is provided by the argument "params": 
def nloglikeobs(self, params): 
# objects in "self" are defined in the parent class: 
Self.exog 
self.endog 
X.shape[1] 
# for details on the implementation see Wooldridge (2019), formula 17.22: 
beta - params[0:p] 
np.exp (params [p]) 
np.dot(X, beta) 


np.empty (len (y) ) 

ll[y eq] = np.log(stats.norm.cdf(-y hat[y eq] / sigma)) 

ll[y g] = np.log(stats.norm.pdf((y - y hat)[y g] / sigma)) - np.log(sigma) 
# return an array of log likelihoods for each observation: 

return -11 


# results of MLE: 

reg tobit = Tobit (endog=y, exog-X) 

results tobit = reg tobit.fit(start params-params start, maxiter-10000, disp=0) 
print(f'results tobit.summary(): WMn(results tobit.summary())Wn') 


M — — — Script 17.13: Example-17-4.py 
import wooldridge as woo 
import numpy as np 

import patsy as pt 

import scipy.stats as stats 

import statsmodels.formula.api as smf 
import statsmodels.base.model as smclass 


recid = woo.dataWoo('recid') 


# define dummy for censored observations: 

censored = recid[’cens’] != 0 

y, X = pt.dmatrices('ldurat ~ workprg + priors + tserved + felon +’ 
‘alcohol + drugs + black + married + educ + age’, 
data=recid, return type-'dataframe') 


4 generate starting solution: 
reg ols = smf.ols(formula-'ldurat ~ workprg + priors + tserved + felon +’ 
‘alcohol + drugs + black + married + educ + age’, 
data=recid) 
results ols = reg ols.fit() 
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sigma start = np.log(sum(results ols.resid ** 2) / len(results ols.resid)) 
params start - np.concatenate((np.array(results ols.params), sigma start), 
axis-None) 


# extend statsmodels class by defining nloglikeobs: 
class CensReg(smclass .GenericLikelihoodModel) : 
def init (self, endog, cens, exog) : 
self.cens - cens 
super(smclass.GenericLikelihoodModel, self). init  (endog, exog, 
missing-'none') 


def nloglikeobs(self, params): 
X = self.exog 
y = self.endog 
cens = self.cens 
p = X.shape[1] 
beta = params[0:p] 
sigma = np.exp(params[p]) 
y_hat = np.dot(X, beta) 
il = np.empty (len (y)) 
# uncensored: 
ll[-cens] = np.log(stats.norm.pdf((y - y hat)[-cens] / 
sigma)) - np.log(sigma) 


# censored: 
ll[cens] 
return -ll 


np.log(stats.norm.cdf(-(y - y hat)[cens] / sigma)) 


4 results of MLE: 

reg censReg = CensReg(endog=y, exog-X, cens=censored) 

ensReg = reg censReg.fit(start params-params start, 
maxiter-10000, method-'BFGS', disp-0) 

print(f'results censReg.summary(): \n{results_censReg.summary()}\n’) 


Script 17.14: TruncReg-Simulation.py 
import numpy as np 


import pandas as pd 
import matplotlib.pyplot as plt 


import statsmodels.formula.api as smf 
import scipy.stats as stats 


# set the random seed: 
np.random.seed(1234567) 


x = np.sort(stats.norm.rvs(0, 1, size-100) + 4) 
y 9-4 * 1 * x + stats.norm.rvs(0, 1, size=100) 


# complete observations and observed sample: 
compl = pd.DataFrame(('x': x, 'y': y}) 
sample - compl.loc[y » 0] 


# predictions OLS: 

reg ols = smf.ols(formula-'y - x’, data=sample) 
results ols - reg ols.fit() 

yhat ols - results ols.fittedvalues 


# predictions truncated regression: 
reg tr = smf.ols(formula-'y ~ x’, data-compl) 
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results tr = reg tr.fit() 
yhat tr - results tr.fittedvalues 


# plot data and conditional means: 


plt.axhline(y-0, linewidth-0.5, linestyle-'-', color-'grey') 
plt.plot(compl['x'], compl['y'], color-'black', 

marker-'o', fillstyle-'none', linestyle-'', label-'all data’) 
plt.plot(sample['x'], sample['y'], color-'black', 

marker-'o', fillstyle-'full', linestyle=’’, label-'sample data’) 
plt.plot(sample['x'], yhat ols, color-'black', 

marker=’’, linestyle-'--', label-'OLS fit’) 
plt.plot(compl['/x'], yhat tr, color-'black', 

marker-'', linestyle-'-', label-'Trunc. Reg. fit’) 


plt.ylabel('y') 

plt.xlabel('x') 

plt.legend() 
plt.savefig('PyGraphs/TruncReg-Simulation.pdf') 


M — — — — — —— Script 17.15: Example-17-5.py 
import wooldridge as woo 
import statsmodels.formula.api as smf 
import scipy.stats as stats 


mroz = woo.dataWoo ('mroz') 


# step 1 (use all n observations to estimate a probit model of s i on z i): 
reg probit = smf.probit(formula-'inlf ~ educ + exper + I(exper««2) +’ 
‘nwifeinc + age + kidslt6 + kidsge6', 
data=mroz) 
results probit = reg probit. fit (disp=0) 
pred inlf = results probit .fittedvalues 
mroz['inv mills'] = stats.norm.pdf(pred inlf) / stats.norm.cdf(pred inlf) 


# step 2 (regress y i on x i and inv mills in sample selection): 

reg heckit = smf.ols(formula-'lwage ~ educ + exper + I(exper««2) + inv mills', 
subset-(mroz['inlf'] -- 1), data-mroz) 

results heckit - reg heckit.fit() 


# print results: 
print(f'results heckit.summary(): \n{results_heckit .summary()}\n’) 
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p — —— — — —— Script 18.1: Example-18-1.py 
import wooldridge as woo 
import pandas as pd 

import statsmodels.formula.api as smf 
import statsmodels.api as sm 


hseinv = woo.dataWoo(’hseinv’ ) 


# add lags and detrend: 

hseinv['linvpc det'] = sm.tsa.tsatools.detrend(hseinv['linvpc']) 
hseinv['gprice lagl'] = hseinv['gprice'].shift(1) 
hseinv['linvpc det lagl'] = hseinv['linvpc det'].shift(1) 
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# Koyck geometric d.l.: 

reg koyck = smf.ols(formula-'linvpc det ~ gprice + linvpc det lagl', 
data-hseinv) 

results koyck = reg koyck.fit() 


# print regression table: 
table koyck = pd.DataFrame(('b': round(results koyck.params, 4), 
'se': round(results koyck.bse, 4), 
't': round(results koyck.tvalues, 4), 
'pval': round(results koyck.pvalues, 4))) 
print(f'table koyck: \n{table_koyck}\n’) 


# rational d.1.: 
reg rational = smf.ols(formula-'linvpc det ~ gprice + linvpc det lagl +’ 
'gprice lagl', 
data-hseinv) 
results rational - reg rational.fit() 


# print regression table: 
table rational = pd.DataFrame (('b' 


round(results rational.params, 4), 
round(results rational.bse, 4), 
't': round(results rational.tvalues, 4), 
'pval': round(results rational.pvalues, 4))) 
print(f'table rational: \n{table_rational}\n’) 


# LRP: 
lrp koyck = results koyck.params['gprice'] / ( 

1 - results koyck.params['linvpc det lagl']) 
print(f'lrp koyck: (1rp_koyck}\n’) 


lrp rational = (ri 
x 


ults_rational.params[’gprice’] + 
ults rational.params['gprice lagl']) / ( 

1 - results rational.params['linvpc det lagl']) 
print(f'lrp rational: (lrp rational)n') 


Script 18.2: Example-18-4.py 
import wooldridge as woo 
import numpy as np 
import pandas as pd 
import statsmodels.api as sm 


inven = woo.dataWoo(’ inven’) 
inven['lgdp'] = np.log(inven[’ gdp’ ]) 


# automated ADF: 
res ADF aut = sm.tsa.stattools.adfuller(inven['lgdp'], maxlag-1, autolag-None, 
regression-'ct', regresults-True) 

ADF stat aut - res ADF aut[0] 

ADF pval aut = res ADF aut[1] 

table = pd.DataFrame({’names’: res ADF aut[3].resols.model.exog names, 
'b': np.round(res ADF aut[3].resols.params, 4), 

: np.round(res ADF aut[3].resols.bse, 4), 
't': np.round(res ADF aut[3].resols.tvalues, 4), 
'pval': np.round(res ADF aut[3].resols.pvalues, 4)]) 

print(f'table: \n{table}\n’) 

print(f'ADF stat aut: (ADF stat aut)Wn') 

print(f'ADF pval aut: (ADF pval aut)Wn') 
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_______________ Script 183: Simulate-Spurious-Regression-1.py 
import numpy as np 
import pandas as pd 

import statsmodels.formula.api as smf 
import matplotlib.pyplot as plt 
import scipy.stats as stats 


# set the random seed: 
np. random, seed (123456) 


# i.i.d. N(0,1) innovations: 


n= 51 

e = stats.norm.rvs(0, 1, size=n) 
e[0] = 0 

a = stats.norm.rvs(0, 1, size-n) 
a[0] = 0 


# independent random walks: 
x 7 np.cumsum(a) 
y = np.cumsum(e) 
Sim data = pd.DataFrame(('y': y, ‘x’: x}) 


# regre: 
reg 
results 


(formula-'y ~ x’, data-sim data) 
reg.fit() 


# print regression table: 
table - pd.DataFrame(('b': round(results.params, 4), 

z round(results.bse, 4), 
't': round(results.tvalues, 4), 
'pval': round(results.pvalues, 4))) 


print(f'table: \n{table}\n’) 


# graph: 
plt.plot(x, color-'black', labelz'x') 
plt.plot(y, color-'black', , label='y’) 


plt.ylabel('x,y') 
plt.legend() 
plt.savefig('PyGraphs/Simulate-Spurious-Regression-1.pdf') 


—— Script 18.4: Simulate-Spurious-Regression-2.py - 
import numpy as np 
import pandas as pd 

import statsmodels.formula.api as smf 
import scipy.stats as stats 


# set the random seed: 
np. random. seed (123456) 


pvals = np.empty (10000) 
# repeat r times: 


for i in range (10000): 
# i.i.d. N(0,1) innovations: 


n-51 
e = stats.norm.rvs(0, 1, size-n) 
e[0] = 0 


a = stats.norm.rvs(0, 1, size-n) 
a[0] =0 
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# independent random walks: 

x = np.cumsum(a) 

y = np.cumsum(e) 

sim data = pd.DataFrame(('y': y, 'x': x}) 


# regression: 
reg = smf.ols(formula-'y ~ x’, data-sim data) 
results = reg.fit() 

pvals[i] - results.pvalues['x'] 


# how often is p«- 
count pval smaller = np.count nonzero(pvals <= 0.05) # counts True elements 
print(f'count pval smaller: (count pval smaller)Wn') 


# how often is p>5%: 
count_pval_greater = np.count_nonzero(pvals > 0.05) 
print(f'count pval greater: {count_pval_greater}\n’) 


M —— — Script 18.5: Example-18-8.py 
import wooldridge as woo 
import pandas as pd 

import numpy as np 

import statsmodels.formula.api as smf 
import matplotlib.pyplot as plt 


phillips = woo.dataWoo(’phillips’) 


‘1948’, periods=len (phillips), freq-'Y') 


# estimate models: 

yt96 = (phillips['year'] <= 1996) 

reg 1 = smf.ols(formula-'unem ~ unem_1’, data-phillips, subset-yt96) 

results 1l = reg 1.fit() 

reg 2 = smf.ols(formula-'unem - unem 1 + inf 1', data=phillips, subset-yt96) 
results 2 - reg 2.fit() 


# predictions for 1997-2003 including 95% forecast intervals: 
yf97 - (phillips['year'] » 1996) 
pred 1 = results 1.get prediction (phillips[yf97]) 
pred 1 FI = pred l.summary frame( 

alpha-0.05)[['mean', 'obs ci lower', 'obs ci upper']] 
pred 1 FI.index = date range.year[yf97] 
print(f'pred 1 FI: \n{pred_1_FI}\n’) 


pred 2 - results 2.get prediction (phillips[yf97]) 
pred 2 FI - pred 2.summary frame( 

alpha-0.05)[['mean', 'obs ci lower', 'obs ci upper']] 
pred 2 FI.index - date range.year[yf97] 
print(f'pred 2 FI: \n{pred_2_FI}\n’) 


# forecast errors: 
el = phillips [y£97] [‘unem’] - pred 1 FI['mean'] 
e2 = phillips[yf97]['unem'] - pred 2 FI['mean'] 
# RMSE and MAE: 

rmsel = np.sqrt(np.mean(el +*+ 2)) 
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print(f'rmsel: {rmse1}\n’) 

rmse2 = np.sqrt(np.mean(e2 ** 2)) 
print (f/rmse2: {rmse2}\n’) 

mael = np.mean (abs (e1) ) 

print (f’mael: {mae1}\n’) 

mae2 = np.mean (abs (e2) ) 

print (f’mae2: {mae2}\n’) 


# graph: 
plt .plot (phillips [y£97][’unem’], color-'black', marker-'', label='unem’) 
plt.plot(pred 1 FI['mean'], color-'black', 

marker-''/, linestyle-'--', label-'forecast without inflation’) 
plt.plot(pred 2 FI['mean'], color-'black', 


marker-'', linestyle-'- 
plt.ylabel('unemployment') 
plt.xlabel('time') 
plt.legend() 
plt.savefig('PyGraphs/Example-18-8.pdf') 


label=’ forecast with inflation’) 
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Script 19.1: ultimate-calcs.py 
AHOROHORORORROOORROORROORHROOROROREDOROREGEDOROONROOSSEHNUHHNHERUNTHNNE 
Project X: 

"The Ultimate Question of Life, the Universe, and Everything" 
Project Collaborators: Mr. X, Mrs. Y 


Python Script "ultimate-calcs" 

HER F Heiss 

Date of this version: February 18, 2019 

PEIEE TTT 
# external modules: 

import numpy as np 

import datetime as dt 


Se Se ee eee 


# create a time stamp: 
ts = dt.datetime.now() 


# print to logfile.txt (/w' resets the logfile before writing output) 

# in the provided path (make sure that the folder structure 

# you may provide already exists) : 

print (f/This is a log file from: \n{ts}\n’, 
file-open('Pyout/19/logfile.txt', 'w')) 


# the first calculation using the function "square root" from numpy: 
resultl = np.sqrt (1764) 


# print to logfile.txt but with keeping the previous results (‘a’): 
print(f'resultl: {result1}\n’, 
file-open('Pyout/19/logfile.txt', 'a')) 


# the second calculation reverses the first one: 
result2 - resultl «« 2 


# print to logfile.txt but with keeping the previous results (‘a’): 
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print (£/result2: (result2)', 
file-open('Pyout/19/logfile.txt', 'a')) 


Script 19.2: ultimate-calcs2.py 
# external modules: 
import numpy as np 
import datetime as dt 
import sys 


# make sure that the folder structure you may provide already exists: 
sys.stdout = open(’Pyout/19/logfile2.txt’, 'w') 


# create a time stamp: 
ts = dt.datetime.now() 


# print to logfile2.txt: 
print(f'This is a log file from: Wn(ts)in') 


# the first calculation using the function "square root" from numpy: 
resultl - np.sqrt (1764) 


4 print to logfile2.txt: 
print(f'resultl: {result1}\n’) 


# the cond calculation reverses the first one: 
result2 = resultl ** 2 


# print to logfile2.txt: 
print(f'result2: (result2)') 
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F test, 123 
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functions, 61 
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autoregressive conditional (ARCH), 224 
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instrumental variables, 249 
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interactions, 143 


JTRAIN, 259 
Jupyter Notebook, 302 


kernel density plot, 38 
KIELMC, 230 


IATgX, 303 

law of large numbers, 69 
LAWSCHB5, 159, 185 

least absolute deviations (LAD), 190 
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log files, 301 
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logit, 267 
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matplotlib, 27 
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mean absolute error (MAE), 296 
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measurement error, 180 
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object, 11, 62 
OLS 
asymptotics, 127 
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estimation, 78, 101 
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p value, 57 
packages, 8 
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probit, 267 
pseudo R-squared, 268 
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SAS import, 24 
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script, 4 
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Stata import, 24 


t test, 55, 70, 115 

Text import, 24 
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time trends, 201 
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two-way graphs, 27 


unit root, 210, 291 
unobserved effects model, 234 
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White standard errors, 165 
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working directory, 9 


